Entity resolution is
the process of merging multiple files (or duplicate records within
a single file). Once merged, the records referring to the same physical
object are treated as a single record. Records are matched based on
the information that they have in common. The records that you can
merge appear to be different but can actually refer to the same person
or thing. Entity resolution, record matching, and surviving records
are supported in DataFlux Data Management Studio.
SAS Master Data Management is a more powerful
and specialized application that performs similar functions.
The SAS match engine
has been designed to enable both the identification of duplicate records
within a single data source. It also works across multiple sources.
The rules-based matching engine uses a combination of parsing rules,
standardization rules, phonetic matching, and token-based weighting
to strip the ambiguity out of source information. After applying hundreds
of thousands of rules to each and every field, the engine outputs
a match key. This match key
is an accurate representation of all versions of the same data generated
at any point in time. A sensitivity level
enables you to define the closeness of the match in order to support
both high-confidence and low-confidence match sets.
As well as offering
pre-defined match rules, SAS has designed an extremely customizable
match engine that allows your organization to define custom matching
rules. These match rules can include any number of fields, as well
as any number of match conditions coupled together using Boolean rules
(AND/OR). Matching records are then assigned a single group ID that
can be persisted and maintained over time. Duplicate records are grouped
based on linkage rules or automatically consolidated into a single best
record.
Note that the SAS engine
always generates the exact same match key for similar data generated
at any point in time across the enterprise. SAS is the only data quality
vendor that generates a single match key for an entity. These keys
can be generated in real time. They can also be persisted to a data
source to facilitate cross-system matching in batch or real-time environments.
After eliminating pattern
differences, the match engine applies a series of string manipulations
to each token. Then, after applying hundreds of thousands of rules,
it outputs a 15-character match key. The match key can be used to
identify the address relationship across any number of disparate data
sources. When other fields such as business name and postal code are
combined in the match process, the records can be rationalized as
a single entity.
Following this stage,
the engine can be configured to perform one or all of the following
tasks:
-
displaying the matches in report
format
-
automatically consolidating the
records into one best record
-
appending grouping keys to each
record and persist the keys over time
-
writing match keys to an index
or a cross-references table
The SAS engine does
not require a change in the source or presentation data
to match the records. The SAS match algorithm applies data parsing,
standardization, and other algorithms during the match process. Some
data quality vendors require adherence to a strict methodology: parse
first, standardize second, and match third. The SAS engine applies
all of these processes during the actual match process.
The SAS engine supports
matching the types of information listed in the following list:
-
-
-
City, State/Province, Post Code
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Record consolidation,
or duplicate elimination, merges multiple records into a single best
record. The match engine supports user-defined
record-level and
field-level rules.
These rules enable you to use the engine to pick and choose information
from multiple records to compile a single version of the entity.
This record consolidation
includes both basic and advanced rules such as the following:
-
Field is not null or field is null
-
Field is not equal to X or is equal
to X
-
Field has the highest occurring
value within the duplicate record set
-
Field contains the highest value
within the duplicate record set
-
Most or least recent create date
-
Source is equal to specified
field or string
The SAS engine supports
persisted key clustering (or house holding). This type of clustering
enables the assignment of a single unique key to any record that conforms
to user-defined match rules. The engine uses sophisticated SAS match
keys to group records and assign an integer-based unique identifier.
If new records enter the system, the SAS clustering engine assigns
the existing integer ID to the new record and then logs the record
into the cluster table.
Sometimes, a new record
enters the system that causes two or more unique households to collapse
into one household. In that case, the SAS engine assigns the existing
cluster ID to the records. Then it logs the old household ID in the
grouping table. This engine supports the business need to track the
lifetime activity of any entity from supplier to end consumer and
can be run in both batch and real time.
For example, a record-level
rule might call for the preservation of a record with the most recent
edit or create date. However, this record might not include accurate
address information. If the address exists in another record, field-level
rules can be used to extract the address from the secondary record.
Then the address in the primary record can be replaced with this trusted
content.