cancel
Showing results for 
Search instead for 
Did you mean: 
dylanwan
Employee
Employee

Question

Sometimes, you need to verify if there are duplicate records in a data set.  It is possible to compare column by column to find records that have the same values, but this approach could be slow.  What is the optimal way to perform the comparison?

Answer

Use hash keys to make the comparison quickly and efficiently!

What is a hash key?

A hash key is a small value that is used to represent a large piece of data in a hash system.

Why use a hash key?

With a hash, you read each file once and create a short 128-bit or 256-bit string for each record that can then be used for comparisons.

How to create a hash key

Use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.

Add a hash key column: 

from pyspark.sql.functions import sha2, concat_ws
df_sha = df.withColumn("cols_sha2", sha2(concat_ws("||", *df.columns), 256))

Find records with the same hash

To determine if there are any duplicate records, use:

incorta.show if distinct("type2_cols_sha2") > 1

Reference

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Contributors
Version history
Last update:
‎07-05-2023 10:08 AM
Updated by: