Data-first exploration: What and why?

To have effective data quality, you need to know what your data is supposed to look like.

This sounds obvious, but the data-first exploration needed to achieve this is often overlooked during the data quality process.

Why do you need it?

No matter how detailed your existing spec is, no matter how much effort was put into creating it, operational usage inevitably causes the actual data to start creeping away from the spec. Usually in the way where it’s more complicated than the spec.

The cause is the human element, even for very well-run businesses. Any system where humans input or modify the data is going to be affected by the unpredictable. Because humans have habits, humans like shortcuts, and humans will use their judgment when no path is clearly “more correct.” And different peoples’ judgements will be… different.

Users will affect the data.

That includes how the data conforms to the spec.

The National Highway Traffic Safety Administration collects data about fatal car accidents. There are detailed specifications about how to record the data, like the name of the road the accident occured on: just the street name (no number), spell out the street type, don’t prefix county roads, etc.

It’s very clear about what you’re supposed to do. But does the data conform to this?

No, of course it doesn’t. First responders to the scene of a fatal vehicle accident aren’t putting effort into remembering whether the NHTSA wants “Wisconsin Ave” or “Wisconsin Avenue.”

It’s counterproductive to blame people for not conforming to a data spec almost all the time. Data entry or modification is usually auxiliary to a person’s actual job, whether that’s customer service, providing medical care, or law enforcement.

If data within a field varying from a form specification was the biggest issue, you would probably still be able to use the spec for data quality fairly effectively.

It isn’t, though.

Valuable information can end up in a “Notes” field because a new business need arose after the system was implemented and the software wasn’t updated.

Or new varieties of a product or service come and go without the spec ever being updated. The data about those varieties lingers in the database, undocumented in the spec, as the people who were there and knew what it’s about begin leaving the company.

Or maybe the issue didn’t arise at the point of data entry. In one project, the Great Migration of 2004 had truncated a small percentage of values in a certain field. (This still counts as originating from the human element, since the migration didn’t do itself.)

By looking at the data first, you can build a spec for your data quality program that actually accounts for what you have.

But how to do it?.

There are 3 basic parts to data-first investigation:

Look at your raw data. No, not like sitting and randomly scrolling through it (although if you’re unfamiliar with the dataset you should probably do a bit of that too).

Use an analysis and profiling tool to get an idea of what it’s in there and let it point you to records of interest: statistical outliers, uncommon data patterns, values that occur too frequently or not frequently enough. And, yes, variations in form and format.
Match your data. It’s all too easy for the same person, place, or thing to end up represented more than once in the data, even if you only have one system.

Here’s where bigger organizations have an easier time: with powerful entity resolution tools, you can have a much easier time matching data in this early, unimproved-by-data-quality state.

At smaller organizations, do the best you can with what you have available, whether that’s a lower-powered but still purpose-built match/merge tool or homegrown queries and code. Be creative in investigating how you can reliably match data: don’t stick to ID numbers alone and, if you can, account for non-exact matches (for example, let Wisconsin Ave match Wisconsin Avenue).
Check your data against your spec. See if your assumptions about the processes, policies, and procedures surrounding your data hold up to reality. Does every record have an address? Is this outlier value invalid, or is there something missing from the spec?

Sometimes you might be able to look to your tools or documentation to answer these questions. Sometimes you’ll have to scour the basements (metaphorically, and possibly literally) to find the person who knows the answer.

To get the best result, work through these steps as a very interactive and iterative process. It’s better to get some results quickly and use those to inform your next try than to spend a long time on each step trying to get everything you possibly can the first time through.

The importance of the data-first approach is directly correlated to how much data you have. If you have relatively few systems and people with a lot of direct experience with the data, you might be able to get away with combining that direct experience with your spec plus maybe a pass or two of the data-first process.

For large enterprises, data-first work is extremely important. Large amounts of historical data, siloed operations, and a large variety of systems multiply the opportunities for hidden types, complexities, and workarounds to manifest in your data.