Fuzzy Text Matching in PySpark UDF

suxinji · ‎04-18-2022

Overview

This recipe will teach you how to use the FuzzyWuzzy package to match strings in Incorta Notebook. String matching is the most common problem in business. For example, if you have a customer master in your enterprise apps and also have a customer roster from a third-party system, you might need to match the customer data from the third-party system with the customer master. Both tables have the address fields. You can find the closest matching records from the customer master to determine if those customers are new or existing customers based on their addresses.

FuzzyWuzzy is a Python package used for string matching, which can compute the similarity score between two strings.

Solution

First, you need to install the FuzzyWuzzy package in your environment.

# install fuzzywuzzy using pip
pip install fuzzywuzzy

#Import Lib
from fuzzywuzzy import fuzz
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.functions import current_date

#Read Files
df1 = read("FUZZY.Roster")
df2 = read("FUZZY.Customer")

# Get Column Names in Data Frame
df1.columns
df2.columns

# Create Temp Views
df1_temp = df1.createOrReplaceTempView("df1")
df2_temp = df2.createOrReplaceTempView("df2")

# Join
df = spark.sql("select df1.Gpoid as Matched_id, df1.Address1, df2.* from df2 CROSS JOIN df1 ON df1.State_Province = df2.Region_State_Province_County AND df1.Postal_Code = df2.Postal_Code")
incorta.show(df)

# Apply Fuzzy Function
def matchstring(s1,s2):
    return fuzz.token_sort_ratio(s1,s2)
MatchUDF = udf(matchstring,StringType())

# Add columns and filter score more than 80
df = df.withColumn("similarity_score", MatchUDF(col("Address1"),col("Street_and_House_Number"))).withColumn("run_date", current_date())
df = df.filter(df.similarity_score > 80)
df = df.drop(col("Address1"))
incorta.show(df)

# Save Data Frame
save(df)

Screen Shot 2022-04-06 at 1.31.24 PM.png

Screen Shot 2022-04-06 at 2_mosaic.png

Screen Shot 2022-04-06 at 3_mosaic.png