Use hash keys to compare two records in PySpark

dylanwan · ‎07-05-2023

Question

Sometimes, you need to verify if there are duplicate records in a data set. It is possible to compare column by column to find records that have the same values, but this approach could be slow. What is the optimal way to perform the comparison?

Answer

Use hash keys to make the comparison quickly and efficiently!

What is a hash key?

A hash key is a small value that is used to represent a large piece of data in a hash system.

Why use a hash key?

With a hash, you read each file once and create a short 128-bit or 256-bit string for each record that can then be used for comparisons.

How to create a hash key

Use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.

Add a hash key column:

from pyspark.sql.functions import sha2, concat_ws
df_sha = df.withColumn("cols_sha2", sha2(concat_ws("||", *df.columns), 256))

Find records with the same hash

To determine if there are any duplicate records, use:

incorta.show if distinct("type2_cols_sha2") > 1

Reference

Stackoverflow