Data Fingerprinting

Accelerating Data Mapping and Unification using Fingerprints

Data Fingerprinting

Data fingerprinting is a concept where there is a column of data that has a signature or a fingerprint and we will get to identify the data values in that column and determine what other columns share the same fingerprints so that mapping of the data takes place.

In this process, the comparison of column values is done across different tables and a hash code against the column is generated. Irrespective of what the column name is labelled across different tables, if the column shares the same data, then a score will be generated from 0 to 1 as how much of data is matched and then the mapping of the data will be done and the data will be merged. This score will be generated using an algorithm.

For example, if there are different tables where the column is labelled as “col”,”column”,”col1”, but the data which is shared in the columns are same, then the data is checked, a hash will be generated against that column, a score between 0 to 1 is generated and then mapping of the data takes place by merging the columns.

Machine Learning to accelerate fingerprinting

Even if the condition comes positive the machine would consider this as not matching since # id and Sid are having similar finger print and Study ID and Sid does not match pass the threshold test. The machine here learns by itself and become intelligent

Why is data fingerprinting useful?

In this process, the comparison of column values is done across different tables and a hash code against the column is generated. Irrespective of what the column name is labelled across different tables, if the column shares the same data, then a score will be generated from 0 to 1 as how much of data is matched and then the mapping of the data will be done and the data will be merged. This score will be generated using an algorithm.

For example, if there are different tables where the column is labelled as “col”,”column”,”col1”, but the data which is shared in the columns are same, then the data is checked, a hash will be generated against that column, a score between 0 to 1 is generated and then mapping of the data takes place by merging the columns.