Our Software
Solutions

Entity resolution in MIOvantage: An explainer

Entity resolution is an essential technology for deriving meaningful insight and use from your data. While the basic concept is relatively straightforward, implementing it is considerably more complex.

Over time, your data will change and evolve. In order to keep meeting your operational and analytical needs, MIOvantage equips you with the following features:

With MIOvantage, you can carry out entity resolution that's maintainable over time and is high-performance for immediate entity resolution activities, while retaining a detailed data lineage.

Representing Entities: Modeling

MIOvantage's entity resolution is driven by a central business model. Unlike traditional data models, our model is highly flexible and allows you to leverage industry-, domain-, and company-specific knowledge.

A key differentiator: our model uses a business-driven perspective of the entities. Its primary lens is the business, not the physical or logistical structure of a data storage system. This often results in a multi-entity, multi-domain model.

The model defines first-order concepts about the entities that you're interested in. In addition to the base data about the entities that is already being held, this also can include computations like profitability, favorite locations, historical buying patterns, or upsell opportunities.

This is possible because our business model derives from the idea of the object model, and has many of the same properties, including logic, polymorphism, and hierarchical composition.

Model boundaries

The model also represents information about the relationships between entities. These are established by what MIOvantage calls context boundaries.

Context boundaries are the one aspect of MIOvantage's modeling for entity resolution that has technical considerations: while business needs should be a major factor in the placing context boundaries, they also have an important role in determining how MIOvantage partitions and stores data. This has ramifications for MIOvantage's performance.

Representing Data: Fragments

Fragments

As data arrives, MIOvantage breaks it down into pieces that are relevant at the entity level using the model.

In most scenarios, a single piece of data—generally, when data is incoming initially, this "piece" is a single record from a single system—contains only a part of the information that the business knows about an entity.

MIOvantage uses each record to fill out the model to the extent possible. This partial instantiation of the model is what MIOvantage calls a fragment.

MIOvantage does not restrict how complete or incomplete a fragment must be on a technical level.

From a practical perspective, a fragment needs to include a certain level of detail in order to be useful. Generally, sources that cannot consistently provide fragments with enough information should be identified and ruled out during data preparation.

EXAMPLE

Telecom Inc.'s incoming data contains fragments A and B, which both represent a customer.

There is some overlapping information in the fragments:

But each fragment also contains information which the other fragment does not contain.

Sample Fragments for Telecom Inc

In these example fragments, looking at the overlapping data suggests that these fragments could represent the same person, since they share a SSN, last name, and first and middle initials.

Depending on the context in which you are using the data, this may be an acceptable amount of evidence. In other situations, you may need additional information in order to consider the two fragments to be about the same person.

This is the type of deduction and decision-making that must be encoded in the system: having a person review each fragment pair is not feasible, and human review will have imperfect consistency of decision-making. Automated decision-making, on the other hand, is scalable, consistent, and reviewable.

Finding Entities: Clustering

As data volumes increase, it becomes impossible to maintain a strategy of comparing each fragment for potential matches with every other fragment while also maintaining an acceptable level of performance.

To avoid this, MIOvantage uses an approach where it sorts fragments into smaller pools as soon as they enter the system. Fragments within these pools are then grouped together with other fragments that are likely to match each other.

This "likelihood" is defined by MIOvantage, which lets you define the piece(s) of data used for fragment pool sorting on a per-project basis.

The item or combination of items that you use to define the fragment pools is called the fragment index. Each pool is made up of fragments which have identical fragment index values.

As MIOvantage indexes each fragment and adds it to the pool, the pool retains a copy of the fragment in its current state. This copy will be retained even after the fragment has passed further into the entity resolution process.

Once in the pool, the fragments in the pool are clustered, identifying which fragments are actually about the same entity. Clustering is approximately equivalent with matching in match-merge-link terminology.

The resulting clusters are groups of fragments that are about the same entity; each group (cluster) is then passed through the entity resolution process. There's no limit on the size of a fragment group; it could contain hundreds of fragments, dozens, or just one.

The entity is constructed by considering all of the fragments in the cluster.

Using a fragment pool reduces the amount of comparison processing needed to establish clusters, which allows MIOvantage to maintain the speeds needed for high-performance high-volume entity resolution.

Technical discussion: Index value selection

Selecting values for the fragment index is partly driven by the business perspective of the entities: what data is both valuable and available for identifying an entity, and what isn't.

But you also need to consider the performance needs of the solution: more closely the fragment pools correspond to actual entities, the better the solution will perform. The more important high performance is, the more important it is to select an effective fragment index.

However, even for a well-selected fragment index, it's possible that at some point a disproportionately large fragment pool will be created. Often, though not always, this happens because one of the pieces of data included in the index has a default value in a large segment of records.

The too-large fragment pool can impact MIOvantage's performance. To combat this, MIOvantage allows you to specify a maximum pool size. Any fragment pool that exceeds this size will be dissolved, and the index value that was used to create the pool will no longer be considered a valid index value, for existing or for incoming fragments. The index will continue to work normally for all other values that occur for the index.

Fragment Clustering

In production systems—i.e., in circumstances that involve real data—it's rare to find a one-size-fits-all approach that works to create effective fragment clusters. The wide variety in fragments and the different ways that matches can be identified necessitates an entity resolution solution that can use many different strategies.

MIOvantage makes this possible this by using unsupervised machine learning to permit fragment matching based on a wide variety of functions.

Fragment matching in MIOvantage can be based on individual pieces of data, using:

Concrete examples of matching function types that are commonly employed in MIOvantage are shown in Table 1. These functions are specified at the level of individual pieces of data within fragments, allowing very complex definitions of how two fragments are allowed to match.

Sample Cluster Function Types

Function Type Allows matching
Ignore To ignore elements such as special characters and leading or trailing 0s.
Phonetic Based on phonetic similarity using algorithms such as Soundex, Cologne, Phonix, Metaphone, QGram, and more.
Substring Based on similar substrings, prefixes, or postfixes.
Error To take place even with a certain amount of typographical errors, transpositions, or errors per characters or tokens.
Numeric Based on closeness of numeric values or other arithmetic functions.
Data-Specific Based on known reference data.
Data-Type-Specific Based on data-type specific functions, such as temporal or geospatial proximity.

EXAMPLE (CONT.)

In the current example scenario, Telecom Inc. decides that for any two fragments similar to A and B in structure, the fragments can be considered to match if their SSNs are identical and their names are very similar, or if the SSNs differ by one typographical error and all name fields are identical.

In practice there is almost always more than one way in which two fragments can be considered a match to each other. To accommodate this, MIOvantage expresses match rules as matrices.

A rule matrix is a group of individual rules.

Each individual rule represents a different way in which two fragments should be allowed to match. For fragments to match by a rule, all of the rule's conditions must be true for the fragment pair.

The matrix as a whole represents that the match should be allowed at all. If fragments match by at least one of the rules within a matrix, they are considered to match by the matrix as a whole.

In other words, a match takes place if the boolean AND of all conditions for a rule are true, for any rule (boolean OR) in the matrix.

EXAMPLE

Shown below is an example cluster matrix for Telecom Inc.

Rule 1 Rule 2 Rule 3 Rule 4
Social Security Number Equal Equal Equal 1 Typo
Date of Birth Equal Equal
First Name First Initial First Initial Equal Equal
Middle Name First Initial Equal
Last Name Phonetic Similarity 2 Typos

In actual deployments, cluster matrices can be much more extensive and complex than shown in this basic example.

MIOvantage's cluster rule implementation supports these more involved rules in multiple ways, including:

Transitive Matching

MIOvantage uses a transitive match strategy as part of its clustering, which allows it to support a greater degree of nuance than direct-match-only strategies do.

With transitive matching, fragments are considered to match each other—and therefore, can belong to the same cluster—under the following circumstances:

Transitive matching allows, theoretically, for two fragments to match even if they have no data whatsoever in common. In practice, fragments which match will often have something in common, although in the absence of the other fragments this commonality would not be definitive.

Transitive matching is particularly useful in cases where fragments originate from sources which have very different purposes, and therefore have relatively little data in common even though they are about the same entity.

Transitive matching is also useful in scenarios involving spatial data, like geographical events. Any two given events may have occurred too far apart to be considered directly related. However, an event at an intermediate point might make a relationship between those distant points more probable; events at multiple intermediate points make it even more so.

Finally, transitive matching also plays an important role in making entity resolution possible long-term. Because fragment matching in MIOvantage is available even for resolved entities, transitive matching can be used to combine and split existing resolved entities as new fragments bring additional information about the entity or entities.

Building Entities: Synthesis

An entity will usually be created using data from more than one fragment, although MIOvantage does not require this.

The more important to the business an entity is, the more likely it is that multiple systems know about the entity, creating a multi-fragment cluster. However, even comparatively minor entities are often created from multiple fragments.

Multi-fragment entities introduce the possibility that fragments which are about the same entity can have conflicting data about that entity.

There can be a variety of underlying reasons for these conflicts. Common ones include:

When these conflicts are present, they must be resolved (so that a consistent entity can be available for analytics, operations, and other uses), and/or identified (so that the source of the conflicts can be addressed).

After MIOvantage matches fragments, it uses a process called synthesis to determine the resolved picture of the entity.

Synthesis resolves conflicts between two or more values by examining not just the values themselves, but their metadata and the sources they originated from. Synthesis is approximately equivalent to merging in match-merge-link technology.

The values selected by the synthesis rules are used to populate the final representation of the entity. Generally, it is this resolved entity that is made available to end users, applications, and databases.

Note that MIOvantage does not require all of the selected values to come from the same fragment(s), or to come from any particular number or combination of fragments.

Synthesis rule definition

MIOvantage provides extensive facilities for defining the synthesis process.

At the base level, MIOvantage can create automatic synthesis rules using a rule wizard. The wizard automatically detects the attributes and relationships of the entity, then creates a default synthesis rule for each one.

In most situations, you will want to modify the wizard-generated rules in order to tailor the synthesis behavior more precisely to your needs. You can also create rules from scratch.

Metadata that synthesis rules can evaluate include:

Although the synthesis process produces a resolved entity, MIOvantage also retains the original, pre-synthesis fragments. These are maintained and accessible as long as the entity exists in MIOvantage.

Typically, fragments are only accessed directly for non-operational purposes like validating the entity resolution process. However, the fragment data can be used instead of or in addition to the resolved entity data operationally, if needed.

Entity Provenance

Even after an entity has been resolved, MIOvantage maintains all of the original fragments that were used to create it, even those that did not contribute any data the resolved entity.

This allows MIOvantage to maintain the provenance of a cluster and its associated entity. This is primarily leveraged in two major ways: entity re-evaluation, and entity investigation.

Entity re-evaluation

Consider two fragments which do not match directly, and do not have any mutual matches between them. These fragments would belong to distinct entities in MIOvantage.

However, operational systems routinely have new data enter the system.

New fragments, when combined with the existing fragments, may create a mutual match (or chain of matches) between fragments that previously had no connection. This new match indicates that two (or more) previously-formed entities are actually the same entity.

But this discovery of a new match is only possible if all original fragments are kept, since some or even all data from a fragment may not be retained to the final entity during the resolution process.

If this originally-unused fragment is essential to the new match, discarding it precludes the ability to ever make that new match. This places significant limitations on the ability to re-evaluate entities: particularly, data from any system that is added after the initial matching takes place is less useful.

Since MIOvantage does retain all original fragments and always works from the fragment level, new data can be continually evaluated in context with the existing data at the fragment level. Any data contributed to the system can contribute equally to the entity resolution, even data added after the initial deployment or data that was not used the first time an entity was resolved.

For example, consider a pair of fragments A and B. When these fragments first enter MIOvantage, there is no direct or mutual match between them, so they are placed in separate clusters.

MIOvantage creates separate entities, Entity 1 and Entity 2, based off of the clusters containing fragments A and B. The entities are then made available for operational use.

Later, a new fragment C enters the system. This fragment matches both A and B.

When this match is made, Entity 1 and Entity 2 needs to be re-evaluated.

Fragments A, C, and B now belong to the same cluster (match group). The cluster is passed through the synthesis process, resulting in Entity 3.

MIOvantage removes Entity 1 and Entity 2, and adds Entity 3. This process is relatively efficient because MIOvantage maintains the previously-created data structures.

Although this example illustrates the essential behavior of what is taking place during entity resolution, note that in practice it would be common for the linked entities to be resolved from multiple fragments each. It is also possible for more than two entities to be re-evaluated as a single fragment group.

Entity investigation

Since MIOvantage retains the original fragments that were used to create an entity, these fragments can be retrieved and examined even while the resolved entity is also being used operationally.

This allows the entity resolution process to be investigated and analyzed long-term, not just during initial development. Fragment access post-synthesis can be used for purposes like continuous improvement, error correction, and root cause analysis.

These types of long-term improvement is only possible when the original fragments remain accessible; if only the resolved entities are retained, data is technically lost at every step.

For example, consider a situation in which a customer service representative enters a person's name as "Cate."

This fragment then goes through the entity resolution process.

If the existing entity—which matches the fragment in every other respect—spells the name "Kate," and no other information is retained, incorporating the new fragment into the entity is not well-informed: is the entity name misspelled, or is the new fragment's name misspelled?

But if the fragments behind the existing entity are retained, there can be additional evidence regarding which value is correct, or possibly evidence that the correct value is not known.

Consider a situation where fragments used to create the existing entity are evenly split between "Cate" and "Kate." Depending on the MIOvantage configuration, the entity can be flagged because the name choice cannot be made confidently, or synthesis rules may use an alternate form of decision-making where value frequency is not decisive.

For instance, in the top Entity 25 of the diagram, perhaps "Kate" was selected because the fragment that was created earliest used "Kate," or because a generally-trustworthy system used "Kate," because "Kate" is more common than "Cate," or perhaps "Kate" was simply selected arbitrarily.

But consider a situation where the majority of the fragments use the name "Kate," with occasional occurrences with other spellings. In this situation, it's possible to be relatively confident that the new fragment contains the misspelling, and that the entity should retain the spelling "Kate." This same logic would hold no matter which spelling was most common, allowing for occasional data entry errors without constantly changing the entity's resolved name.

This kind of assessment is only possible when all fragments are retained.

Connecting Entities: Relationships

Once an entity exists, its own data may suggest potential relationships that it could have with other entities.

For instance, a customer of an insurer has a relationship to the policy that they hold. They also have a relationship to the beneficiaries of that policy, and a relationship to address where they live. If other customers live at that same address, or insure property at that address, the original customer is likely have a familial or social relationship with those other customers.

This type of relationship can be represented by MIOvantage, then automatically identified and created during synthesis. This relationship discovery in MIOvantage is approximately equivalent with linking in match-merge-link terminology.

MIOvantage's model uses an explicit relationship modeling approach. Entity relationships can be categorized as either "snipped" or "unsnipped."

An "unsnipped" relationship means that the two entities, as defined in the model, should be considered as one entity from a business perspective. Essentially, an unsnipped relationship is an intra-entity relationship.

For example, for various reasons, the MIOvantage model may represent a personal name as an entity with an unsnipped relationship to a customer entity. There are reasons to make this distinction in the model, but those distinctions are not important from the perspective of the resolved "customer" entity.

A "snipped" relationship means that the two entities, as defined in the model, should not be considered as one entity. A snipped relationship is an inter-entity relationship.

This is the kind of relationship that a customer has to a policy, or a customer to a residence; the two entities are connected, but from a business perspective they are not one.

In other words, a snipped relationship is a context boundary, as introduced in the modeling discussion.

MIOvantage's process for automatically discovering snipped relationships between different entities is particularly powerful because it can use the fragment clustering mechanisms to identify and create these relationships even when they are not explicitly denoted in the data.

One of these primary mechanisms is the data that identifies a potential relationship with another entity. This data is used to created a special fragment of that other entity's type, called a grappling fragment.

The grappling fragment maintains a reference to the entity that created it, but then undergoes the normal clustering process for an entity of its type.

Once the grappling fragment has been clustered and resolved into an entity, that entity will "grapple" back to its source entity. This creates the snipped relationship, which is immediately visible to both entities.

With a snipped relationship established, each entity can push summary data about itself to the other entity. This allows both entities to have immediate, up-to-date access to the most critical information about the entity, without manual data denormalization, and while maintaining the performance benefits of MIOvantage's approach to entity storage.

As an example of grappling, consider a Policy. In addition to the policyholder, who is a known Customer, the policy also has a beneficiary.

MIOvantage can be configured to recognize the name (and any other information about the beneficiary that is known) as potential Customer* information. Therefore, in addition to the Policy fragment, each record of a policy also generates grapple Customer fragments for the policyholder and the beneficiary.

If the beneficiary of the policy is also a customer, with their own accounts, the beneficiary grapple fragment will be clustered with the beneficiary's other fragments. MIOvantage has now made it possible to connect the policyholder with the beneficiary, even if the only place where their information appears together is on the record of the policy.

This also means that second- (and more) degree connections can be identified in MIOvantage: for example, the policyholder now has a relationship with the beneficiaries of the original beneficiary's other policies. This is particularly useful for applications of entity resolution such as fraud detection.

* To better accommodate the business reality of situations like this, actual MIOvantage implementations may use a "Person" entity instead, which recognizes customers, beneficiaries, etc. under its umbrella and allows individuals to hold multiple roles. The example relationship models shown, like all those discussed here, have been simplified for illustrative purposes.

Consider a situation in which separate systems hold Policy and Customer information. A policy record has the customer ID number of the policyholder, but a customer record does not have any information about a person's policies.

In the model, you establish a snipped relationship between Customer and Policy entities.

When records are loaded from the Customer system, there is no way to connect any of them during entity resolution to any Policy records that might exist, except that every Policy has created a Customer grapple fragment from its Policyholder ID.

While the Policyholder ID is the only data that grapple fragment has, it's enough to match the grapple fragment to another Customer fragment. When entity resolution for Customers takes place, the grapple fragment is incorporated into a Customer entity and contributes its knowledge of the relationship to a specific policy to the final Customer entity, so this relationship can be represented in the data.

In this example, it was simple to match the Customer fragments for illustrative purposes. With real-life data, which often does not have reliable (or any) keys, the grapple fragment is essential for building the inter-entity relationships, since it cannot be replicated with a lookup.

Ongoing Use of Entities

Performance

Performance and data volume

A key benefit of MIOvantage is that its entity resolution performance is independent of the amount of existing data and the number of existing resolved entities.

Instead, MIOvantage's entity resolution performance is related to the amount of incoming data.

This is the case because, when a new fragment is received, it undergoes the normal indexing and clustering process. But only the fragment groups to which new fragments are actually added continue with the entity resolution process. The fragment groups which are unchanged, and the associated entities, do not change state.

Consider the example entity re-evaluation scenario. When fragment C entered the system, fragments A and B, and entities 1 and 2, were activated as a result.

But any operational enterprise would also have many other fragments and entities present in the system at the same time.

These fragments and entities were not activated by the entry and matching of fragment C, so they remained dormant.

This behavior is critical to MIOvantage's scalability. Not only does it ensure that the amount of processing needed for entity resolution is independent of the amount of stored data, but it allows MIOvantage to work effectively with data sources where data creation and updates are frequent.

Consider, for instance, streaming, transaction queue, and sensor data sources. These sources can generate large amounts of new and updated data on a continual basis.

If all data is activated during the entity resolution process, performance has two limiting factors: the amount of incoming data and the amount of existing data. With MIOvantage's approach, performance faces only one limiter: the amount of incoming data.

The re-evaluation and retrieval process is efficient because MIOvantage stores all data that belongs to the same entity together on the disk, as described in Representing Entities: Modeling. As a result, while the database as a whole can be highly distributed, each individual entity's data is not.

Storing all data about an entity together significantly reduces search and retrieval time when compared to a strategy that distributes an entity across multiple storage locations.

As a result of the entity-based storage approach, MIOvantage can take advantage of the speed of the speed of in-memory computing and the economy of disk-based storage at the same time. The value of this is particularly evident at the petabyte-plus scale, where in-memory storage is impractical with current technology.

Performance and model boundaries

For purposes of storing and re-activating entities, entities that have unsnipped relationships with each other are considered to be one entity. Thus, in the example unsnipped relationship shown previously, the Customer entity would include, and be stored with, the Name and Address entities.

The presence of a snipped relationship indicates separate entities. Thus, again considering the previous example, the Policy entity is not automatically retrieved or updated just because the Customer entity was.

Since the snipped relationships and context boundaries determine the size of entities, placement of the context boundaries needs to balance two competing factors:

When placing context boundaries, you must balance these factors along with your performance needs for MIOvantage. Under-snipping can create entities that are too large for the solution to be performant; over-snipping can create a situation where the number of entities required to do anything also reduces performance.

MIOvantage alleviates some of the compromises required for this by allowing summary information to be pushed across snipped relationships. This summary is designed to contain the most commonly-needed information about the entity, so that the related entity can know this information without having to retrieve the entire entity.

For example, in the previous Customer-Policy unsnipped relationship, the policy's number and value might be pushed to the Customer. The customer's policy numbers and their values can then be known while looking at the Customer entity without retrieving the corresponding Policy entities.

This summary data remains current because MIOvantage automatically pushes updates (if any) to all extant summaries whenever the originating entity is updated.

Summary

The basic tenets of entity resolution are relatively straightforward. But when you closely examine the actual implementation needs for entity resolution in terms of operations, analytics, and performance, you reveal more complex issues.

MIOvantage entity resolution is designed to allow flexible, multi-entity and multi-domain entity resolution, using a business model. This also allows the entity resolution to discover relationships between entities, even when these relationships are not explicitly identified in the data.

The MIOvantage solution is high-performance as data arrives, regardless of whether the data is creating new entities or updating existing entities. By maintaining the lineage of the entities with their original fragments, MIOvantage entity resolution also delivers maximum meaning over time.

As a result, the MIOvantage solution to entity resolution is both high-performance and high-quality, making it ideally suited for a wide range of applications.