Pre-Processing (Tokenizing) Text Data for Tag Clou...

JoeM · ‎01-04-2023

Introduction

When using a tag (word) cloud, you may find a whole text field being applied to the insight. That may be useful in some cases if the text fields contain one or two words, but extra preparation could be required when working with longer-form text.

In the above insight, each row in the dummy fake customer complaint data is likely unique. Therefore, we want to transform our data from reviews to the most frequently mentioned words to find any "themes."

What you need to know before reading this article

Before reading this article, a basic understanding of creating materialized views will help you follow along.

Creating a Materialized View Using PySpark

Let's Go

Attached is the sample data. Please feel free to use it to follow along

Step 1: Import the field for text analysis

Once importing the attached file to a schema, add a materialized view. Instead of reading the whole table, I've queried directly against the Complaint_Data and limited it to 1000 reviews.

df = incorta_sql("SELECT Community_Data.Complaint_Data.Consumer_complaint_narrative FROM Community_Data.Complaint_Data WHERE Community_Data.Complaint_Data.Consumer_complaint_narrative IS NOT NULL LIMIT 1000")

Step 2: Tokenize

Next, we will go through each row and break each word into its own row.

import pyspark.sql.functions as f
tokenize = df.withColumn('word', f.explode(f.split(f.col('Consumer_complaint_narrative'), ' ')))

As a result, the data will appear as follows:

Step 3: Count Words

Next, let's summarize to see how many mentions of each word we have.

word_count = tokenize.groupBy("word").count().sort('count', ascending=False)

Result:

Step 4: Preparation and Discovery

Chances are that once you view your data, you will want to include some edits. Maybe filter punctuations, remove uninsightful words ("I," "a"), or perhaps provide a list of words you want to find. In this instance, I've removed some words that give no insight into the complaints. More often than not, this process could be pretty complex if we notice several data quality issues.

#Filter IS NOT IN List values
li=["the","to","I","and","have", "for", "is", "this", "me", "XXXX", "a", "my", "that", "of", "was", "in", "on", "XXXX,", "it", "as", "are", "by", "or", "been", "would", "(", ")","", "XX/XX/XXXX", "XXXX."]
filtered_word_count = word_count.filter(~word_count.word.isin(li))

Step 5: Save the MV

save(filtered_word_count)

Step 6: Play around with the Tag Cloud Insight

Even after the materialized view is loaded and saved, you still might have too much data to build a meaningful insight. For example, 1000 complaints will still have ~10,000 words if each complaint is only ten words!

Not helpful

In this case, limit this insight's Max # of rows. Go to the insight settings and set it to 100.

Much better!!

Related Materials

Creating Materialized View using PySpark