.png) dylanwan
		
			dylanwan
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		
			on 
    
	
		
		
		07-05-2023
	
		
		10:07 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
 - edited on 
    
	
		
		
		07-05-2023
	
		
		10:08 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
 by 
				
		.png) Tristan
		
			Tristan
		
		
		
		
		
		
		
		
	
			
		
Sometimes, you need to verify if there are duplicate records in a data set. It is possible to compare column by column to find records that have the same values, but this approach could be slow. What is the optimal way to perform the comparison?
Use hash keys to make the comparison quickly and efficiently!
What is a hash key?
A hash key is a small value that is used to represent a large piece of data in a hash system.
Why use a hash key?
With a hash, you read each file once and create a short 128-bit or 256-bit string for each record that can then be used for comparisons.
Use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.
Add a hash key column:
from pyspark.sql.functions import sha2, concat_ws
df_sha = df.withColumn("cols_sha2", sha2(concat_ws("||", *df.columns), 256))To determine if there are any duplicate records, use:
incorta.show if distinct("type2_cols_sha2") > 1