Can we automate data quality and can it support machine learning?

Machine learning and data quality

I recently blogged on the emergence of the IoT (Internet of Things) and how it’s a new front in the battle against poor quality data. This time I wanted to take a two-pronged approach to another emerging trend – machine learning. With this blog, I want to examine whether machine learning can be used to maintain and improve data quality but also look at the risks to machine learning posed by poor quality data.

What is machine learning?

The terms ‘machine learning’ and ‘artificial intelligence’ are being used rather interchangeably and are also inextricably linked to other terms such as deep learning, algorithm and natural language processing to name a few. I’m not going to get into the dictionary meanings and differences, so for the purposes of this post, Machine Learning is a way to use algorithms to not only automate but to also improve a defined process or set of processes. A really important point here is that automation has existed for a long time – however, once ‘trained’ an automated system could not learn if the environment or data changed. Machine learning (as the name suggests) allows for the automation rules to change over time to adjust to environmental changes.

Artificial intelligence (or AI) is the ability for a computer to simulate or imitate human behaviour – which may or may not be a good thing when considering data quality.

I’ll, therefore, try to minimise confusion by avoiding the use of AI and will focus on how algorithms (those sets of rules and calculations that help solve defined problems) can either support the improvement of data quality or be thrown off by poor quality data should the possibility of poor data not be considered in their construction.

Using machine learning to improve data quality

This seems quite simple on the face of it but as with any digital transformation, moving from manual to automated and onto ‘intelligent’ data quality management will require a long-term plan. At Experian, we’ve long spoken about the progression of data management as a 4 stage process – the “Data Maturity Curve”.

Our research over a number of years has shown a steady progression up the maturity curve as organisations take their data more seriously. There is still a pretty even spread along the curve but with the advent of the GDPR in 2018, we expect to see more organisations moving away from the lower levels as they look to instigate more advanced data management processes, tools and policies to support their compliance response.

For those organisations already towards the ‘optimized and governed’ end of the scale, we could be seeing the emergence of another level; something that I think could be called ‘intelligently automated’.

Most data quality programmes already contain an element of automation (for example, running a de-dupe script once a month) and test and learn – reviewing how a change to a rule, a new validation step or reference dataset can improve data quality. A simple example to visualise is changing a field on your website sign up form to increase or decrease the number of title options such as adding Mx or removing Lord.

Does this change encourage more people to fill the field in? Does it help your outbound marketing team to better address messaging to customers? If not, could you simply drop the field entirely to speed up the sign-up process and reduce dropouts?

A next step is the use of machine learning to automatically recognise and take action on different types of data. For example, a data management tool that can recognise an address, email, credit card number, national insurance number and so on with little pre-training or rule writing before taking actions such as validating the entry or flagging a compliance issue to a manager.

The ultimate goal of course is machine learning for data quality that then improves itself over time. A good example of this is company name – is Tesco PLC the same as Tesco Stores Ltd? What about a part of the Tesco group which does not have the word ‘Tesco’ in the company name? Grouping commercial entities together can be as simple as looking for the name or more complex by looking at the detail of company accounts, head office addresses, CEO names, web addresses and other metadata to find linkages around the globe. Iteratively improving that process with each data update and as new data sources are added is a key potential benefit of machine learning.

These kinds of hypotheses are the business challenges that a strong data strategy can support. However, can we move to a place where we can automate this learning and improve our data quality over time with less manual effort, giving our data people more time to analyse and support the business?

That’s the challenge for machine learning – taking the base business rules for data quality, implementing them and then suggesting improvements as the real-world changes in data become visible as exceptions or outliers.

It’s an emerging subject and one that I expect to see a great deal of development on in the years ahead.

Could poor data quality create problems for machine learning?

Automating tasks normally done by humans often creates a level of nervousness and has not always resulted in the positive impacts envisaged by those doing the automation.

The 2008 financial crisis is an interesting example of what happens when algorithms are written in a way that fails to take the real world into account or adds so many layers of complexity that they become difficult to understand for a human – even those humans who wrote them! This article from the time discusses some of the concerns held with such complex automation of financial trades. But what if the data that the algorithms seek to use is wrong?

The use of machine learning has exploded in the last decade with it now being used for a wide range of advanced and more mundane tasks. That could be anything from from recommendations on movie streaming services, to chatbots helping you navigate the London Underground, to the way a supermarket arranges the aisles to place items that people most commonly buy either closer together (for convenience) or closer to products they don’t normally buy but could be encouraged to try in order to create a higher value basket for the shop.

However, what could happen if an algorithm is set to work on poor quality data? Ending up at the wrong end of the Northern Line with a melted tub of ice cream and some crab sticks is unlikely to be your idea of a good Friday night out.

The risks in the future could be far more severe – if we begin trusting machine learning to improve the discovery and testing of pharmaceuticals (as an example), what would happen if a drug were formulated but there were errors in the chemical compound data used to simulate testing? If it were then to reach human trials with that error intact, the impact on those people involved could be severe. Obviously, the potential for the use of machine learning is huge – enabling us to design drugs to work with our genomes, to dispense medicines that are more compatible with already in-use drugs to cut side effects or improve the resolution of MRI scans and spot the kinds of details such as early tumour formation which the human eye would struggle to see. The benefit of the massive amounts of data in the medical arena could also mean rapid improvements to these kinds of diagnostic algorithms. In an industry that already relies on data standards, it’s encouraging to see that data quality is taken so seriously with the standards already available for chemicals data and so on. In less mature industries though, there is likely more work to do.

An emerging application of machine learning that could also be impacted by poor base data is self-driving vehicles. Whether it’s the mundane cars and lorries or more ‘out there’ flying taxis, there are huge reams of data that will be critical to get vehicles from place to place safely. From maps and addresses to how a vehicle reacts to a cyclist or, in the case of a flying taxi, a new building under construction, the data used to teach the machine will be crucial to consumer and regulator adoption.

As we’ve seen over the years with the growth of satellite navigation and news stories of people ending up stuck on beaches (as their sat nav didn’t let them know that they should take a ferry) or lorries trapped under railway bridges, the need not just for location data but ‘context’ data is critical. Organisations involved in the mobility industry are working to improve the data ecosystem with better map data, points of interest, supplementary information and of course, the combination of consumer contributed data (in apps like Waze) and machine learning to improve the data with every journey. When fully autonomous vehicles become the norm, the hope is that road safety can be improved exponentially. Whilst strong global standards already exist with much of this reference data, new sources and the algorithms that use them will need attention to ensure that, for example, a pedestrian crossing is described in the same way on every road.

Setting the strategy

Fundamentally, every use of machine learning is reliant on data that is fit for purpose – if the automation of decisions is based upon poor base data, the decisions can’t be trusted. This all starts with a data strategy – thinking about the reasons for embarking upon machine learning and what you hope to achieve and the outcomes you want to avoid.

From here, an initial assessment of your data should be performed to sense check the quality of what you have already, plan for what you may need to acquire and how this can all be brought together in a manner which improves the quality of the data and the result whilst also ensuring the ethical use of the data.

Machine learning will enable organisations to manage data quality more efficiently as well as help organisations to make better, faster decisions. Whether you’re thinking about investing in machine learning itself or a data management tool with machine learning capabilities, think about the impact of and on the quality of your data and the outcome you wish to achieve.

Whatever stage you’re at, it’s always a good idea to monitor your data quality maturity and identify the areas which may require attention. That way you’ll be in the best possible position to kick off machine learning initiatives in the future, safe in the knowledge that your people, processes and technology are geared up. You can check your own maturity here.