Organizations are aware of the risks associated with poor data quality and the devastating impact it can have across various business operations. As a result, a lot of time and resources are expended every week to perform data cleaning techniques, such as data standardization, data deduplication, , etc.
Although a reactive approach that finds and fixes data quality issues may produce results, it is definitely not productive. Companies want a more proactive approach – a framework that looks for data quality problems on an ongoing basis and ensures that data is kept clean most of the time.
In this blog, we will be looking specifically at the issue of resolving entities (also known as record linkage), as well as discussing a comprehensive framework that can help resolve such issues.
What Is Entity Resolution?
Entity resolution means matching different records to find out which ones belong to the same individual, company, or thing (usually termed an entity).
The process of entity resolution solves one of the biggest data issues faced by most companies and that is: attaining a single view of all entities across different assets. This refers to having a single record for each customer, product, employee, and other such entities.
This problem usually occurs when duplicate records of the same entity are stored in the same or across different datasets. There are many reasons why a company’s dataset may end up with duplicate records, such as a lack of unique identifiers, incorrect validation checks, or human errors.
How to Resolve Entities?
The process of resolving entities can be a bit complex in the absence of uniquely identifying attributes since it is difficult to understand which information belongs to the same individual. However, we will look at a list of steps that are usually followed to match and resolve entities.
- Collect and profile scattered data
Entity resolution can be performed using records in the same dataset or across datasets. Either way, the first step is to collect and unify all records in one place that need to be processed for identifying and merging entities. Once done, you must run data profiling checks on the collected data to highlight potential data cleansing opportunities so that such errors may be resolved initially.
- Perform data cleansing and standardization
Before we can match two records, it is important that their fields must be in comparable shape and format. For example, one record may have one address field, while another record may have multiple fields that store the address, such as street name, street number, area, city, country, etc.
You must perform data cleansing and standardization techniques that parse a column, merge multiple columns into one, transform the format or pattern of data fields, fill in missing data, etc.
- Match records to resolve entities
Now that you have your data together – clean and standardized – it is time to run data matching algorithms. In the absence of unique identifiers, complex data matching techniques are used because you may need to perform fuzzy matching in place of exact matching.
Fuzzy matching techniques output the likelihood of two fields being similar. For example, you may want to know if two customer records belong to the same customer; one record may show the customer's name as Elizabeth while the other shows Beth. An exact data matching technique may not be able to catch such discrepancies but a fuzzy matching technique can.
- Merge records to create a single source of truth
With records being matched and the match score being computed, you can take the decision to either merge two or more records together or just discard the matches as false positive. In the end, you are left with a list of reliable information-rich records where each record is complete and refers to a single entity.
Designing a Comprehensive Framework for Entity Resolution
In the previous section, we looked at a simple way to resolve entities. But when your organization is constantly producing new records or updating existing ones, it gets more difficult to fix such data issues. In these cases, implementing an end-to-end data quality framework that consistently takes your data from assessment to execution and monitoring can be very useful.
Such a framework includes four stages, explained below:
- Assessment
In this stage, you want to assess the current state of your unresolved entities. For resolving customer entities, you may want to know answers to questions like how many datasets contain customer information, or how many customers we have as compared to the total number of customer records stored in our database?
These questions will help you to gauge the current state and plan what needs to be done to solve the issue.
- Design
During this stage, you need to design two things:
- The entity resolution process
This involves designing the four-step process explained above but for your specific case. You need to select data quality processes that are necessary to solve your data quality issues. Moreover, this step will help you to decide which attributes to use while matching records, which data matching algorithms to use, and the merge purge rules that will help to achieve the single source of truth.
- Architectural consideration
In this stage, you also need to decide how this process will be implemented architecturally. For example, you may want to resolve entities before the record is stored in the database, or resolve them later on by querying data from the database and loading results to a destination source.
- Implementation
This is the stage where the execution happens. You can resolve entities manually, or use any entity resolution software. Nowadays, there are vendors that offer self-service data quality tools that can potentially identify and fix duplicates, as well as expose data quality APIs that can act as a data quality firewall between the data entry system and the destination database.
- Monitoring
Once the execution is in place, now it's time to sit back and monitor the results. This is usually done by creating weekly or monthly reports to ensure that there are no duplicates present. In case you do find multiple records for the same entity again in your dataset, it is best to iterate by going back to the assessment stage and making sure any loopholes present in the process are fixed.
Wrap Up
Companies that spend a considerable amount of time ensuring the quality of their data assets experience promising growth. They recognize the value of good data and encourage people to maintain good data quality so that it can be utilized to make the right decisions.
Having a central, single source of truth that is widely used across all operations is definitely a benefit you don’t want to deprive your business of.
Also published