on 07-05-202310:07 AM - edited on 07-05-202310:08 AM by Tristan
Sometimes, you need to verify if there are duplicate records in a data set. It is possible to compare column by column to find records that have the same values, but this approach could be slow. What is the optimal way to perform the comparison?
Use hash keys to make the comparison quickly and efficiently!
What is a hash key?
A hash key is a small value that is used to represent a large piece of data in a hash system.
Why use a hash key?
With a hash, you read each file once and create a short 128-bit or 256-bit string for each record that can then be used for comparisons.
How to create a hash key
Use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.