Understanding your data

Data migration projects provide an opportunity for understanding how well your data has been managed in past years.

You can finally ignore the assumptions, anecdotes and accusations about your data; data migration will expose the whole truth (and nothing but the truth!) about your data. The inevitable uncertainty around legacy data quality is why I always plead with companies to get started with landscape analysis as soon as possible.

As you start to perform initial data profiling and discovery using tools like Experian’s Pandora free data profiler, you’ll find data quality issues. These problems typically cover the full spectrum of data quality rule violations, forcing you to make many cleansing decisions but the biggest decision will be:

“At what specific stage and location should we cleanse our legacy data?”

The cleansing strategy taken at this point will largely dictate the success of the subsequent migration. Upon finding issues, many project leaders make the mistake of deciding to apply some hardened policy making without really understanding the implications. Here are some of the project-wide policies I’ve witnessed in the past.

Policy Idea #1: “We will fix it all in the target”

I’ve witnessed numerous projects where the perceived wisdom is: “We don’t have the resources/funding/time for cleansing; we’ll just take the data as-is.” By adopting a tactic of ignoring data flaws, or at best, ‘fixing them in the target’ you introduce yet another flawed approach.

As a project leader you are effectively saying:  “We’ll push the business users to improve the data quality after go-live (when they’ve already got their hands full learning the new system and ironing out any other operational issues)”. Clearly, this is not a sound state of affairs. You obviously want to be improving the data in the location that has the least business disruption.

Some data CAN be corrected in the target environment but don’t create a policy that spans all data; you instead need to pick the right treatment for the right symptom.

Policy Idea #2: “We will manually fix it all in staging”

A staging area strategy often relies on a kind of ‘halfway house’ where data resides and can be pre-processed before migration. It allows you to modify the legacy data into structures that closely resemble the target environment so that the final load can be validated and uploaded more easily.

The advantage of a staging area is that workers can make wholesale changes to the data without impacting the live legacy environment. The dilemma with changing data in the staging area is twofold:

  • How can we track manual changes made by workers?
  • How can we ensure consistency with the legacy systems (or target systems) if data is being migrated in multiple phases?

It has always been important to track changes to legacy data but with increasing demands for regulatory compliance it is no longer just common sense, there are often legal implications around lineage and provenance of data to consider too.

Policy Idea #3: “We will cleanse all data in the transformation code”

This approach requires the migration team to build cleansing routines within the transformation logic of the migration itself. With this method, you could create a series of lookups or cleansing routines which improve the data during the live migration.

In-flight data quality improvement is a useful tactic for speed improvements because you can reduce the necessity for a cleanup staging area. There is a considerable amount of data processing taking place during the migration anyhow, and some modern data management tools combine data migration and data quality functionality making life a lot simpler.

Policy Idea #4: “We will cleanse all the data in legacy”

With this policy strategy, the project leader aims to tackle all the data problems in the original systems before migration.

Solving issues in the legacy system can be highly beneficial because you’re not only improving data quality for the planned target system, you’re giving value back to the current users of the system.

There is one challenge with a policy of legacy cleansing, however. Inadvertently ‘fixing’ the data in existing systems can cause upstream or downstream process failures. I’ve created many knock-on issues in the past by correcting incorrect coding values or de-duplicating equipment masters.

You need to understand fully where your data is accessed and processed in the legacy environment before you start making seemingly beneficial fixes.

Data migration project leaders – Best practice

In this post, I’ve tried to outline some of the typical overarching policies I’ve witnessed on migration projects in the past.

Hopefully what this illustrates is that the best approach for project leaders is to avoid central policy making and adopt a ‘best-fit’ strategy for managing data issues based on their specific use cases, incorporating all the different techniques available.

Creating a best-fit strategy starts with performing invaluable data profiling and discovery. Armed with this knowledge, the findings will help inform you what is the optimal strategy to move forward.