Data Matching is the task of finding records that refer to the same entity. Normally, these records come from multiple data sets and have no common entity identifiers, but data matching techniques can also be used to detect duplicate records within a single database.
Identifying and matching records across multiple data sets is a very challenging task for many reasons. First of all, the records usually have no attribute that makes it straightforward to identify the ones that refer to the same entity, so it is necessary to analyze attributes that provide partial identification, such as names and dates of birth (for people) or title and brands (for products). Because of that, data matching algorithms are very sensitive to data quality, which makes it necessary to pre-process the data being linked in order to ensure a minimal quality standard, at least for the key identifier attributes.
Another fact that introduces additional complexity to this problem is that data can change over time. For example, if two databases of people information are being matched against each other, it is not rare to find cases where the same person has different addresses (since people occasionally move) or even different names (since people can get married or divorced).
Data matching problems generally have no training data available (a data set with matches that we know are valid across the analyzed databases) and so it is not easy to approach the problem with a supervised learning algorithm, as it would be possible in many machine learning applications.
The databases that are subject of data matching analysis can be very large. In the worst case, each record of one database has to be compared to all other records on the other one. Tasks like this can be extremely computationally expensive and hard to be accomplished in feasible time, since this approach involves an algorithm with quadratic complexity. To deal with this situation and make the data matching scalable, some kind of indexing technique can be applied to reduce the number of record pairs that will be compared, but it is not always easy to design an index that removes most non-matches and, at the same time, does not affect the quality of the matching pairs.
Data Matching examples
All of these facts, to name only a few, make data matching a very challenging work. Nowadays, however, several application areas recognize data matching as an important tool. The list bellow contains a few relevant examples.
- E-Commerce: this area has grown dramatically in the past few years and it is common for many comparison shopping platforms to have to link products from multiple online stores that have different descriptions. For example, Store A can have the product Notebook - X35 8GB RAM 13" LED and Store B can refer the same notebook as 13" X35 Notebook 8GB RAM 500GB HD.
- Business mailing lists: many businesses have large customer mailing lists they use to advertise their products or services. These lists can often contain dirty data and multiple records for the same customer. Data matching can be used to merge together such records, avoiding the same offer to be sent more than once for the same customer and helping to keep an organized and up to date database.
- Computing: by detecting duplicate chunks of data, deduplication algorithms can reduce storage utilization and network data transfer.
- Healthcare: medical records of patients can be used to study drug effects and reactions to treatments. Obviously, privacy is a major concern and therefore sometimes it is necessary to discard attributes that explicitly identify patients and perform matches using only anonymous data.
- Online fraud detection: the number people conducting online financial transactions increases year after year and, as a result, criminals are constantly trying to use someone else's information, such as bank accounts or credit cards. Identifying these people is particularly difficult in this area because incorrect data occur because people use wrong information on purpose in order not to be identified.
- National Census: the government collects a wide range of information about the population, which is most often collected by different agencies using different standards and stored into separate databases. Matching up these data enables the government to produce statistical reports and better understand many aspects of the country.
This was a brief overview of this large subject, its major challenges and some examples of applications areas. There is much more that could be said and each real world data matching problem has its own complications because each type of data will come from different sources and have different formats, age, errors and inconsistencies.