What is Data Profiling?
Data Profiling is an analytical technique that uses statistical processing to help discover the structure, content and relationships in a data source. These findings can then be used to identify useful insights as well as potential inaccuracies within the data. Rules can then be set up accordingly to deal with these insights/issues. An example of this could be:
- The results may indicate that one attribute has 99.9% unique values - a strong indicator that the attribute could be used as a primary key for the table.
- Another insight may show that an attribute has 98% values of 3 characters length with 2% having 4 characters length – indicating the 2% is invalid. A rule could then be set up to identify any values differing from 3 characters as invalid.
Why is Data Profiling important?
Data Profiling is a vital activity in the data quality lifecycle because it is essential for understanding what the correct data quality rules should be for a given attribute or relationship.
By creating stringent data quality rules you can reduce the amount of incorrect data entering the database and easier identify the incorrect data already inside the database. These rules accelerate and improve the effectiveness of root-cause analysis. By tracing to their source, organisations can begin to understand the original cause of a data quality defect and implement long-term solutions for greater cost benefit.
How can you implement Data Profiling?
Data Profiling is typically executed using data profiling software as they can analyse large volumes of data and create meaningful reports to help the user understand their data more readily and take appropriate action such as ongoing data quality improvement and control.
The benefits of using dedicated profiling software are:
- Correlation Design - allows instant profiling analysis and relationship discovery
- Global Search - instant results for patterns, metrics and defects across your data landscape.
- Impact Analysis - linking financial metrics to data quality metrics.
- Data Prototyping - prototype business rules, data transformations, standardisation, cleansing, enrichment and calculations.
- Relationship Discovery - cross-system Relationship Discovery.
- Automated Defect Detection - discovering thousands of content quality issues.