IT Data Quality – Can your data be trusted? Where and what are your gaps?

Author: Michael Bottroff


Data quality in IT systems is important because it directly impacts the accuracy and reliability of the information used to make critical business decisions. Poor data quality can lead to incorrect analysis, bad recommendations, and ineffective outcomes, as well as inefficiencies and higher costs. Ensuring data quality helps organizations trust the information they use, minimize risks, and maximize the value of their data assets.



The concept of data quality covers a lot of ground, so in this article we will limit the scope of discussion to object-based data; that is data about specific objects in the IT environment (such as computers, users, databases etc.).

If you have been involved in ITAM for even the briefest time, you will know examples of poor quality data are everywhere. Understanding where the areas of improvement are is one thing; utilising the dimensions of data quality to form on-going good practice is another. These dimensions can be observed within a single system, but also between systems that each hold a record about the same object.

Whether you are trying to forecast your budget requirements for your next hardware roll-out/refresh or attempting to determine an accurate licensing position for your upcoming software renewal, the data you leverage to produce those numbers needs to be trustworthy.

Below we will highlight some common contributors to poor data quality seen in ITAM, some examples, and good data quality practices that can be used to mitigate the issues.

Incomplete Information

Blank/missing fields pose major challenges when using the data for critical purposes. The degree to which this is occurs varies between systems but is more common where there are fields that are manually updated, as is often the case in the CMDB.

Example

You need to plan for an upcoming hardware refresh at a department level. You rely on your CMDB information to provide you these figures. However 20% of devices in your CMDB have a blank ‘department’ field. Additionally due to network communication issues, your discovery service is not working on all devices, and many systems are missing model or form-factor information.

Good Practice

Completeness refers to the extent to which data includes all relevant and required information.

  • Data can still be deemed ‘complete’ if it is missing optional information.
  • Completeness relates to specific purposes; to look for ‘recently active’ devices, some kind of ‘last seen’ date field is be needed. To report on the breakdown of your desktop fleet, fields such as Operating System and Make/Model/Form factor may be required.
  • Depending on your system, setting ‘mandatory’ fields can assist in ensuring completeness, but may not always be practical.
  • Setting up reports or views that ‘highlight’ where these attributes are ‘missing/blank’ can assist with maintenance.
  • If there are gaps, are they easily filled?

Relevance is the extent to which data is applicable and relevant to the task at hand:

  • Understanding what the data will be used for, and what sources have what you need. Not all systems are equal.
  • There will be times where you need to leverage information from multiple places at the same time.

Duplicate Data

This issue occurs when the same information is recorded in multiple ways. It can result from extracting data from various isolated systems that when combined, leads to duplicated records. Or it could be the result of poor life-cycle management processes or manual entry into individual systems. Either way undetected duplication leads to overstating numbers and can result in incorrect insights.

Example

You have a desktop device that is reimaged or repair. Due to changes in your IT environment the device name changes, but the serial number stays the same (or vice versa). You now may have the same system twice in various IT systems (depending on their unique identifier).

Good Practice

Uniqueness refers to the extent to which data is distinct and without duplicates.

  • Determine what your ‘primary keys’ are. This should be hostname, serial number, asset tag, or a combination of these.
  • If there are cases where duplication is unavoidable , ensure the ‘older’ record is removed or its status updated appropriately.

Inaccurate Data

This is a significant data quality issue that occurs when data is complete and properly formatted but contains errors or misspellings. The primary issue here is that it can be difficult to identify these issues, as in many cases it ‘looks correct’ to any automated checks. Often this is the cause of human error.

Examples

  • You have two users named “John Smith” at your company. The wrong John was assigned as the ‘user’ of his device in the CMDB.
  • You have a ‘free text’ field in your CMDB for a devices location. 12 Main Street is normally used. However an employee instead enters “Main St” for a number of new devices. You happen to have a report that specifically references “12 Main Street”. These new devices do not show up on the report.

Good Practice

Accuracy is the degree to which data is free from errors, mistakes, and inaccuracies.

  • Is the information correct, and do you have some way to verify this?
  • Is the data automatically populated by a reliable (dynamically updated) source or is it manually entered and prone to human error?
  • Can you cross-reference it against another source?
  • Consider using ‘data-validation’ or pre-selected options to mitigate the problem.

Data-type Inconsistency

This arises when data in a common field is stored in different units or formats (and occasionally languages). This could be an inconsistency across communicating systems, or inconsistencies within a single solution.

Examples

You are a global company looking at the purchase/warranty dates of multiple systems, along with their original purchase price. Some of your dates are MM/DD/YYYY, others are DD/MM/YYYY. You have purchase prices in different currencies and/or the currency is not specified.

In another example, you are attempting to combine a series of numbers from various sources and run calculations on them. However some of the numbers are stored as a ‘string’ while others are integers, and some are float/decimal. The solution you are using requires them to all be in the same format.

Good Practice

Consistency refers to the degree to which data is uniform and consistent across different sources and systems.

  • Frequent candidates: ‘location’ and ‘date’ fields.
  • The ability to customise may be system dependant.
  • Limit the use of ‘free text’ fields.
  • Encourage data-validation. (Dropdowns, option selection, etc).

Conformity is about the data following a set of standard data definitions such as data size, format, and type.

  • Dates should be in date format. Currencies should as currencies or decimals. Numbers should be integers, True/False or Yes/No set ‘Boolean’ or relevant checkbox/option if the system allows for it, basic text should be ‘string’.
  • If there is a ‘character limit’ on a particular field in a particular system (NetBIOS name is a common one), the lowest common denominator should be followed in other systems, to avoid truncation.

Stale/Obsolete records

This occurs over time, usually when a device is decommissioned/retired, but there are gaps in the life-cycle process (or it is not followed correctly). The result is numerous stale/obsolete records left in various systems that do not represent real, active devices. This may only increment in small numbers, but unless they are being actively monitored in each system, they will continue to build up over time.

Example

You have recently replaced hundreds of computers in a hardware refresh. The computers that were replaced have been decommissioned and disposed of, however due to failures in the life-cycle process (or human error) some of the old systems remain in your CMDB as ‘In-Service’, and are still in Active Directory (and not ‘disabled’), etc.

Good Practice

Currency is the extent to which data is up-to-date and available when needed.

  • Given the dynamic nature of IT, ideally you want to ensure data is updated daily
  • Determine what your requirements are and ensure your data is available and extractable within this time period.
  • Do you have a reliable ‘last seen’ date in the system?
  • Do you have a process to identify stale records?

Missing records

Not to be confused with ‘Incomplete Information’, this signifies data records that should be present in a given system but are missing. This is a significant challenge for system owners, particularly in larger organisations. Each system knows what it knows, but it doesn’t know what is missing, without manual and often time-consuming data audits. Absent of automation, these require a high degree of inter-departmental collaboration to compare sources, and is often out of date the day it is completed.

Example

In the image below, imagine you are trying to deploy a new security tool or SAM agent to your entire Windows fleet. How do you know if you have your entire fleet covered effectively? Which is your source of truth?

Source: Jim Schwar (2018 tweet)

Good Practice

There is no one solution to this problem; due to its complexity it is explored further in the next section.

Multiple sources and your source of truth

One of the main challenges with data quality in IT is the multitude of data sources related to a given set of objects, and the difficulty in establishing a single source of truth. This results in a number of issues such as:

  • Data is often entered into systems by different people, leading to inconsistencies and duplications, making it difficult to determine which data is accurate.
  • Integrating data from multiple sources can be complex and time-consuming, especially when different systems use different data formats and structures.
  • It can be challenging to ensure that the data is accurate, up-to-date, and complete, which can negatively impact the quality of insights and decisions.
  • With so much data being stored and shared across different systems, ensuring data security and privacy can be difficult and challenging.

So this raises the question, what is your source of truth? And can it be trusted? It is also worth noting that the source of truth may vary for specific attributes, as some details are system specific. Many organisations would suggest their CMDB is their source of truth, and is the most accurate representation of their environment. While a CMDB usually contains the most comprehensive set of data, it is also the place likely to have the most errors and gaps.

What do you have in place to ensure your CMDB is accurate?

How do you know what is missing from your CMDB (or other sources?)

In order to overcome these challenges, organizations need to have a comprehensive data management strategy in place that includes data governance, data quality control, and data security measures. This will help to ensure that the data used in IT systems is accurate, complete, and secure, enabling organizations to make informed decisions based on the highest quality data.

Determining your gaps

You don’t know what you don’t know. As an earlier image demonstrated, if you were to ask the owner of each solution in your enterprise for the number of X devices in their respective solution, each would give you a different answer. Every solution knows what it has, but it can’t readily tell you what it doesn’t have (but should). To determine this you need to compare it against all of the other sources. This is the same with your source of truth.

However, comparing data points across multiple IT systems can be a challenging and time-consuming process for the reasons mentioned above. Chances are if you have done one too many ‘LOOKUPs’ in Excel from data exports, you have experienced most the issues highlighted in this article.

All of this is compounded by the fact that IT is a very dynamic environment. This is amplified in larger enterprises. If you are undertaking a manual audit-like approach, chances are once you have completed the task and determined where your ‘gap’ is, your result is already out of date, and you are forever chasing your tail.

To overcome these challenges, organizations can use data integration and data quality tools that automate the process of comparing data points across systems. These tools can help ensure that data is consistent, accurate, and up-to-date, and can save time and reduce the risk and impact of human error. Additionally, implementing data governance and data management best practices can help ensure that data is properly understood, managed, and used effectively. As seen in this article, good practice has many dimensions:

  • Completeness
  • Relevance
  • Uniqueness
  • Accuracy
  • Consistency
  • Conformity
  • Currency
  • Clarity (as to sources of truth)

While ITAM remains a relatively niche field, there are ITAM data quality tools such as AirTrack that provide significant value in this space. However, a tool will not solve all your problems. The IT data in your enterprise needs to be looked at as a pillar that sets the foundation of your entire environment, and not just a series of outputs that are consumed. Only through dedicated governance and processes in combination with a great tool will you truly begin to see the positive impact quality data can have on driving better decisions for your enterprise.


If you would like to hear more information about Data Quality in your organisation, and how TMG can help, please reach out to us via email at info@tmg100.com or visit our website https://tmg100.com.

%d bloggers like this: