Data Fingerprinting

Accelerating Data Mapping and Unification using Fingerprints

Data Fingerprinting

In big data, we try to standardise the column names. In order to standardize the column names we use fuzzy match. Fuzzy match is applied on metadata of the data. But, there is a possibility that we cannot rely entirely on the metadata. We have to drill down the data as well. In order to standardize the column names and to unify the data into the specific columns,  Data fingerprinting becomes significantly important.

In Data fingerprinting, we collect the hash code of few columns for specific threshold value. Then, we compare the hash code with other columns’ hash code to find the matching columns. Once the matching columns are found, data is unified into respective columns.

For instance, if there are different tables where the columns are labeled as “col”, “column”, “col1” etc, but the data that is shared in these columns are same. Then, by using the hash value the system understands that the columns are the same.  Then, by comparing the hash value of similar columns matching score between 0 to 1 is generated by the system. Thus, the mapping of the data takes place by merging the columns.

Machine Learning to accelerate fingerprinting

Even if the condition comes positive the machine would consider this as not matching since # id and Sid are having similar finger print and Study ID and Sid does not match pass the threshold test. The machine here learns by itself and become intelligent.

Why is data fingerprinting useful?

In this process, the comparison of column values is done across different tables and a hash code against the column is generated. Irrespective of what the column name is labelled across different tables, if the column shares the same data, then a score will be generated from 0 to 1 as how much of data is matched and then the mapping of the data will be done and the data will be merged. This score will be generated using an algorithm.

For example, if there are different tables where the column is labelled as “col”, “column”, “col1”, but the data which is shared in the columns are same, then the data is checked, a hash will be generated against that column, a score between 0 to 1 is generated and then mapping of the data takes place by merging the columns.