MIOsoft logo

Data First (Part 2): The Spec Is In The Data

Jordan Barrette

If you’re getting ready to start a new integration, application, or analysis project, you’re probably doing all kinds of requirements gathering. Unfortunately, unless you’re considering data first, the copy books are probably wrong, the state diagrams are probably incomplete, and the business processes are probably too simplistic.

These problems are extremely prevalent in data-centric projects.

Traditionally, you use a data last method—you build up the specification a priori, and only use the full, real data set when you try to put your project in production.

But the specifications, even when made with an intense effort from smart people, still end up in conflict with the data. While the project will eventually operate on the data at the micro level, neither the tools nor the methods used to build it emphasize the data truth.

So your specifications will be wrong. And you won’t know it until it’s already cost your project team time, money, motivation, and sanity.

Even though intuition may tell you that a well-run business will naturally have system implementations that strictly reflect high level specifications and processes, there is still the human element.

Your project is almost certainly going to have users, either humans or other systems built by humans. And if you’re using existing data, it will come from other systems with their own (human) users.

Users are unpredictable­–humans have habits, humans like shortcuts, and humans are forced to make decisions when no path is clearly the correct one.

Users will affect the data directly.

For instance, the National Highway Traffic Safety Administration (NHTSA) makes data about fatal accidents, partly sourced from a form that police officers fill out, publicly available. NHTSA includes a lengthy specification for the files, and most of their data adheres to it.

But look at the specification for the field called “TWAY_ID”. It identifies the name of the road, or roads, that the accident occurred on. There’s all sorts of rules for this field’s data: use just the street name instead of the full address, spell out the street type (e.g. “street”), use “SR-“ as the prefix for a state highway, don’t use a prefix on county roads, etc.

But if you profile, analyze, or even just look at the data, you’ll see that people have entered the data in all kinds of ways, and many don’t conform to the NHTSA’s meticulous spec.

For instance, a simple token analysis of the TWAY_IDs for fatal accidents shows lots of street type abbreviations (e.g. “ST”, “AVE”). For accidents in Wisconsin, we see a lot of “CR”s, meaning “County Road”.

Although both these examples are wrong according to the specification, it’s usually counterproductive to blame the user. We have to keep in mind that data entry is often auxiliary to the user’s primary duties, such as customer service. This is even more true in a fatal accident situation, where there are far more important tasks for the police officers than remembering the NHTSA’s street name formatting preferences.

With data last methods, however, you blindly follow the specification to perfection and you expect the users to do the same. But the real specifications need to come from the data. And it’s not just because users put data in strange formats.

You will likely have data that semantically differs from the spec. It might be a simple case where valuable information is shoved into the “notes” field because there was no other place to put it. Or it could be a more complex case where structures and values contradict the specifications’ understanding of a field.

Either way, this suggests that the business processes, policies, or procedures (we’ll call them measures, a term from the healthcare world to describe best practices for patient interaction) differ from what was initially modeled by the existing system’s designers.

If you find that the specification measures are not supported by the data, the data can suggest what measures the users are actually following.

For example, consider a data migration for a state department of transportation. We did this as a professional services project using MIOedge.

We were given tons of specifications for what existing data looked like, so that we could integrate it into a shiny new single database. Specification issues come in all shapes and sizes; in one project we did, the Great Migration of 2004 had caused a small percentage of business’s tax ids to be truncated. In this case, the spec didn’t tell how to decode the status flag for some registrations, among other issues.

The bigger issues came from measure problems: it turns out there were certain edge cases, some only valid for brief periods of time in the history of the department, that weren’t included in the spec. For example, there were instances where the data showed that a particular registrant and vehicle had a certain type of license plate, but the specification claimed that the plate type wasn’t allowed for that vehicle and registrant.

These types of measure problems are common across organizations and data sets. But you can’t ignore these inconvenient contradictions: you risk leaving out important, real business cases and having all types of issues–from skewed analytics to improper billing—go unnoticed.

Imagine being one of those customers with a license plate the specification claims isn’t allowed. Imagine being on the phone or at the DMV, arguing with an agent who’s trying to convince you that there’s no possible way for you to have the plate type that you have.

Both the customer and the business lose.

The agent loses as well, because they’re forced to work within the constraints of the over-simplified system, where business measures are inaccurately modeled. Even worse, the measure could contain invalid or unreachable states.

In our RealinfoQA healthcare quality offering, which is built entirely on our platform, we calculate if best practices were followed for a patient during treatment. These best practices can come from internal initiatives, but many also come from outside agencies­­–especially the measures which are required by the government for determining Medicare and Medicaid reimbursement.

Our secret is that we represent the measure algorithms themselves, with their steps and paths, as data. This data first approach always retains the exact spec of the algorithm in the data, allows our users to design their own best practices, and lets us update regulatory measures quickly.

Just as important is that this lets us verify that a new process is even possible, which we need to do: surprising as it might be, we’ve found impossible states and paths in new algorithms created by well-respected agencies.

So in a world of incomplete specs and misleading measures, how do you efficiently tackle data-centric projects with a data first mindset?

First, look at the raw data. This sounds so simple, but nothing is better than getting your hands wet in the actual data your application, analytics, or integration project will rely on.

Because this is the stage where you are looking for micro-level data problems, you should strive for a good understanding of what the values in your data represent. To do this, you should make sure your tool allows you to explore all the data, not just samples. Surprisingly, it’s only recently that we’ve seen data curation and data preparation tools that emphasize looking at the actual data rather than treating it as a black box.

“That sounds great,” you say, “but I have a really big data set… inspecting the actual values of the data is way too time-consuming.”

Remember Data First Part 1? Your tools should let you analyze and profile the data to verify the initial specifications at a high level, and then let you move between the statistics and the data itself to investigate individual groups of problems.

It can also be helpful to fix micro data quality issues during this step, as you’re looking at the data, so it’s good if your tools also support prescribing, previewing, and documenting those transformations directly.

Second, synthesize data. Even if you have a perfect primary data source (and can point to one as primary), other supplementary, non-structured, semi-structured, and open data sources are sure to present integration issues or inadvertently skew your ultimate understanding of the world.

It’s important to have tools that help bring data together in less-than-ideal situations, since datasets scattered between systems and organizations are rarely designed for compatibility. Make sure your tools have features to help bring data without perfect keys together (we call this step clustering), as well as features to decide between conflicting attributes in the data when it’s brought together (we call this step synthesis).

To truly work data first, make sure your tools support investigating how the data came together and why a particular attribute set was chosen. This is especially important if your tools leverage advanced statistical and machine learning techniques for the clustering and synthesis steps. 

Third, add and analyze measures. After you bring things together you can quickly analyze if your assumptions about the business measures (the processes, policies, and procedures being followed) are correct. Most often, they are almost correct.

You will likely find examples where the specification and the data do not align, most likely because the measures are incorrectly modeled. In these cases, you should look to your tools to help identify the types of cases, but you may have to scour the basements to find the people who can tell you if the cases you found are actually valid within the business.

One of the great things about having a context-based system, is how easy and efficient it is to discover on a contextual level something is wrong because the known business measures and the data differ. Finding these problems is a task we call contextual data quality.

It’s helpful at this stage to have tools that support logic collocated with the data model. You also need strong analysis capabilities for identifying potential conflicts and unique scenarios in what is likely very large pool of data.

In all three steps, the data first methodology works best using tight iterative loops. It’s important that the tools you use provide fast feedback. We’ve found the best way is to visualize everything… the data itself, the effects of transformations, and the synthesis of information… as the data itself and the tool’s parameters are changed.

The usefulness of data first will not be ephemeral. The types and complexity of data and processes modeled in systems continues to grow. As that happens, the traditional approaches will continue break down, and data first principles will continue to become more important.


Stop working with data last. Start working data first. It’s the only way you can be sure you’re working with the data truth.

- Make sure you follow the data first methodology:

  1. Play with the raw data
  2. Synthesize the data
  3. Add and analyze the processes

- Take a look at your current toolset: if it doesn’t include data first features, ask your vendor(s) if they have a plan to include them.