on 07-05-2023 10:07 AM - edited on 07-05-2023 10:08 AM by Tristan
Sometimes, you need to verify if there are duplicate records in a data set. It is possible to compare column by column to find records that have the same values, but this approach could be slow. What is the optimal way to perform the comparison?
Use hash keys to make the comparison quickly and efficiently!
What is a hash key?
A hash key is a small value that is used to represent a large piece of data in a hash system.
Why use a hash key?
With a hash, you read each file once and create a short 128-bit or 256-bit string for each record that can then be used for comparisons.
Use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.
Add a hash key column:
from pyspark.sql.functions import sha2, concat_ws
df_sha = df.withColumn("cols_sha2", sha2(concat_ws("||", *df.columns), 256))
To determine if there are any duplicate records, use:
incorta.show if distinct("type2_cols_sha2") > 1