Step 1: Define Your Data Quality Requirements
This first step sounds obvious but it’s a common pitfall of organisations to base their data quality efforts around whatever tools and skills they have available as opposed to what they actually need to deliver a data quality strategy.
For example, a lot of SQL developers can hand-code simple data profiling routines for statistics such as column completeness, foreign key violations and even pattern analysis. When people see this they often question the need for investment in specialist profiling tools.
Data profiling is an important activity but only makes up one piece of the data quality puzzle of course. The purpose of data profiling is to discover data quality rules and this is where SQL fails miserably. SQL only really adds value when you know what columns you wish to deal with and which rules to apply. Data quality tools allow you to discover rules across thousands of attributes, hundreds of tables and many systems.
For example, you might need capabilities such as:
- Discovery of data relationships between 10+ systems
- Scope determination of 50,000+ attributes
- Data profiling of 30+ metrics against 35,000+ attributes and 1,200+ entities
- Data validation rules for 1,000+ entities
- Data cleansing and transformation for 500 identified attributes and 1,000,000+ records
- Data quality monitoring – twice daily on all validation rules
- Shared data entry validation rules for 3+ business applications
- Quality control environment for 2-3 data stewards
- Data lineage management back through 5+ systems
- Development of data quality rules for shared access
- Data quality rule execution across multiple systems simultaneously
Whilst these might appear excessive at the outset you have to ask yourself:
“What are we trying to achieve now and for the long term?”
Don’t just focus on basic checks like simple column statistics. Think about broader data quality tasks because 9 times out of 10 what you actually need just isn’t feasible with SQL.
Step 2: Create a Requirements vs Capability Matrix
Now you need to list your core requirements on one axis and the product options on the other using a spreadsheet. Obviously an SQL approach will be one of the solution columns.
I like to use a scoring system for each feature:
- 0 – Feature does not exist
- 1 – Feature exists but is not complete
- 2 – Feature exists and is complete
You can adopt a colour scheme for each value to create a simple visual tool to observe which solution has the least gaps. I typically use a Red, Amber and Green system:
- Red – 0
- Amber – 1
- Green – 2
This also allows you to score each solution by creating a total at the end of each column. You can adjust the scoring system by introducing a weighted factor for the criticality of each feature.
Step 3: Create a Justification Description for Each Gap
Where you find a feature (e.g. automatic relationship discovery) that can’t be achieved with SQL you can bet that someone will push back with the question:
“Do we really need this requirement?”
I find it helps to prepare for this inevitable rebuttal by building a case for each feature. In the example of automatic relationship discovery you could say something like:
“We have inherited a poorly documented, silo driven IT landscape. We know that our service depends on functions working well between systems, we just don’t know how these functions are linked between systems or how bad the data is that supports them.
Using a relationship discovery function, we can not only find and document these missing links but also quantify their health, helping us find areas for improvement.
What’s more we can automatically monitor the data quality of these links moving forward.
SQL fails to provide any of these functions at a scale and performance that is practical.”
In my experience, when you list out all the functions required of a data quality tool and then score your SQL approach against this matrix you will soon realise just how impractical and ill-equipped SQL is for delivering data quality improvement and management projects.
Everyone knows that SQL is a poor choice for reasons such as:
- Lack of scalability and poor performance
- Huge gap in essential data quality functions
- Over-reliance on technical staff causes frustration with the business
- SQL can’t be used against multiple systems simultaneously
However, the biggest failure I see with SQL is that it relies on the developer knowing roughly where the problems lie. It lacks the ability to measure and improve data quality on a much broader, enterprise scale.
Using my capability matrix idea above you should be able to create a clear justification for dropping SQL for a more practical and long-term solution.
The final proof comes from implementing a pilot project and comparing a data quality tool against the SQL approach. When you do this I guarantee you’ll “put a nail in the SQL coffin” once and for all because it will be a serious mismatch.
The business in particular will not want to adopt an outdated method of hand-coding scripts when they’ve witnessed how much control and productivity they gain using specialist data quality tools.