Young man drinking juice while using laptop at table in house

I recently attended the IRM Enterprise Data and BI conference held from 4th to 6th November in London. This conference was a great opportunity to not only present but also to gather knowledge from many data practitioners.

One of my favourite sessions was Defining Data Quality Dimensions presented by Nicola Askham and Denise Cook, which gave the audience an opportunity to understand and review their recent white paper covering the six primary dimensions for data quality assessment.

It was interesting that whilst this was the last slot of the three-day conference, the debate that it sparked was unprecedented! I will let you read the whitepaper at your own leisure but was keen to put my thoughts forward on the six dimensions, using the definitions provided by DAMA.

1. Completeness, Uniqueness and Validity of data

We start with the three dimensions I think are relatively easy to understand and measure.

Data that is not unique can waste time and money. Duplicate data delivers multiple letters to the same customer creating a negative impact. It hides the true view of inventory held on a product, wreaking havoc on your purchasing strategy.

My Interpretation

Impact if not met

Completeness

The proportion of stored data against the potential of “100% complete”.

We know when a field has a value and when it does not. Completeness easily tells us how much we know about a customer, how identifiable is a location, or how well a product is defined.

Not having a telephone or mobile number means you cannot call the customer. Not defining product attributes means your customer does not understand enough about what they are trying to buy.

Uniqueness

Nothing will be recorded more than once based on how that thing is identified.

Uniqueness tells you what makes a data entity one of its kind when it is not maintained, we get duplicates. People, products, suppliers are all entities that you expect to be unique.

Data that is not unique can waste time and money. Duplicate data delivers multiple letters to the same customer creating a negative impact. It hides the true view of inventory held on a product, wreaking havoc on your purchasing strategy.

Validity

Data is valid if it conforms to the syntax (format, type, range) of its definition.

Data is valid when it conforms to format, type and range that have been set up as a part of its definition. Postcodes should comply with a particular format, drivers cannot be below a certain age are examples of validity rules.

The problem with invalid data is that the impact can vary from simply not adhering to a set “look and feel” to violating a core principle that drives a business process. Issuing a license to an underage driver can have severe consequences!

I like the above three dimensions as I find them quite easy to understand and I can relate to the impact of not adhering to them. However, these ‘simple’ dimensions are the ones where data quality can fail quite easily. I find that they get ignored far too often with no systems or processes that enforce them during data capture. Trying to correct these dimensional failures can be quite tricky and often needs businesses to go back to the source to get the right data, which is not always possible.

My recommendation would be to adopt measures that prevent these dimensions from failing in the first place when the data is captured. Systems can be designed to make real-time data quality checks on Completeness, Uniqueness and Validity, and enforce these at critical stages of business processes.

2. Timeliness and Consistency of data

So, let’s tackle the next two dimensions that need more care when setting the context.

My Interpretation Impact if not met

Timeliness

The degree to which (a) data represents reality from the required point in time, or (b) customers have the data they need at the right time.

This dimension is all about the likelihood of data being affected by time and the degree to which data represents reality at a particular point in time.

Classic examples include a change of address when a person moves, change in surname after marriage, the age of a person, expired passport etc.

Change in time may not affect all data. Some data may age differently to others. Some age naturally and others on a particular trigger. Similarly, the impact can vary depending on the use of data.

Not being aware of a change in address could result in confidential information being delivered to the wrong recipient.

Consistency

The absence of difference, when comparing two or more representations of a thing against a definition.

Consistency tells you information is being captured as expected and nothing is out of the norm.

We expect that the calculated age is derived from the date of birth and we know that the net price is a combination of gross price, taxes and discounts.

Inconsistent data stands out, once you know what questions to ask. When defining consistency rules, you need to know about the relationships between data. Inconsistencies can reveal fraud, highlight losses, save lives.

For example, a difference between the energy supplied to a row of houses against the actual amount utilised could indicate there is a leak or fraud.

These two dimensions need adequate knowledge on the impact of data that is not timely or consistent with business rules. We need more collaboration with the business users and proactive modelling of what happens when these dimensions fail.

My recommendation would be to spend time on understanding data, profile it to analyse relationships within and across data entities, uncover patterns that appear to be normal and monitor changes to these patterns over time. Work with your business users to truly understand the impact of not having consistent and timely data. These rules require you to be more agile as your business changes its use and grows in consumption of data.

3. Accuracy of data!

So we come to the final dimension and probably one that often is spoken as a synonym for good data quality.

My Interpretation

Impact if not met

Accuracy

The degree to which data correctly describes the “real world” object or event being described.

This is probably one of the toughest dimensions to measure and is always subjective depending on the context in which data is used.

Just because we correct the address and postcode using reference data does not mean that it is the right one for the customer. We often need repeated and manual checks to increase our confidence in accuracy, often bolstered by the rest of the dimensions.

Inaccurate data supports the old adage, garbage in – garbage out. Decisions made on inaccurate data can set your business backwards.

Incorrectly billing a customer for their neighbour’s bill not only incurs a loss of business but also negative press, and worse still, a fine from the regulator.

This dimension was the most debated one at the IRM session, probably due to the fact it is quite subjective. Accuracy requires a thorough knowledge of your data entity, and what makes it accurate. It may require comparing the electronic information against the real world example a lot more. For example, identify checks in person when vetting security or scanning a product barcode to tally up the electronic and real world item.

My recommendation to gain confidence over accuracy is to befriend your business users who know the data entity in the real world and are constantly fighting the battle on what is true. These include the customer-facing operators who perform identity checks, the staff in the warehouse who work with the physical handling and movement of goods, the delivery vans drivers who know what data gets them to the right location. Once you know what works for these users of data, you start understanding the processes the real world uses for accuracy checks and this can reveal ideas on how it translates into measurable data. One of the best examples from the manufacturing industry is the electronic audit trail of quality checks and their outcomes that determine if the product is fit for purpose.

In summary, I would use the 3-2-1 approach in putting these six data quality dimensions into practice. And while DAMA are reviewing and revising the dimensions, this is a great place to start if you are thinking about implementing a data quality programme.

Kick-start measuring your data quality, using the most powerful free data profiler on the market (Bloor).