cancel
Showing results for 
Search instead for 
Did you mean: 
JoeM
Community Manager
Community Manager

Introduction

When using a tag (word) cloud, you may find a whole text field being applied to the insight. That may be useful in some cases if the text fields contain one or two words, but extra preparation could be required when working with longer-form text.

JoeM_0-1672860091446.png

In the above insight, each row in the dummy fake customer complaint data is likely unique. Therefore, we want to transform our data from reviews to the most frequently mentioned words to find any "themes."

What you need to know before reading this article

Before reading this article, a basic understanding of creating materialized views will help you follow along. 

Let's Go

Attached is the sample data. Please feel free to use it to follow along

Step 1: Import the field for text analysis

Once importing the attached file to a schema, add a materialized view. Instead of reading the whole table, I've queried directly against the Complaint_Data and limited it to 1000 reviews. 

 

df = incorta_sql("SELECT Community_Data.Complaint_Data.Consumer_complaint_narrative FROM Community_Data.Complaint_Data WHERE Community_Data.Complaint_Data.Consumer_complaint_narrative IS NOT NULL LIMIT 1000")

 

Step 2: Tokenize

Next, we will go through each row and break each word into its own row.

 

import pyspark.sql.functions as f
tokenize = df.withColumn('word', f.explode(f.split(f.col('Consumer_complaint_narrative'), ' ')))

 

 As a result,  the data will appear as follows:

JoeM_1-1672860824767.png

Step 3: Count Words

Next, let's summarize to see how many mentions of each word we have.

 

word_count = tokenize.groupBy("word").count().sort('count', ascending=False)

 

Result:

JoeM_2-1672860964091.png

Step 4: Preparation and Discovery

Chances are that once you view your data, you will want to include some edits. Maybe filter punctuations, remove uninsightful words ("I," "a"), or perhaps provide a list of words you want to find. In this instance, I've removed some words that give no insight into the complaints. More often than not, this process could be pretty complex if we notice several data quality issues. 

 

#Filter IS NOT IN List values
li=["the","to","I","and","have", "for", "is", "this", "me", "XXXX", "a", "my", "that", "of", "was", "in", "on", "XXXX,", "it", "as", "are", "by", "or", "been", "would", "(", ")","", "XX/XX/XXXX", "XXXX."]
filtered_word_count = word_count.filter(~word_count.word.isin(li))

 

Step 5: Save the MV

 

save(filtered_word_count)

 

Step 6: Play around with the Tag Cloud Insight

Even after the materialized view is loaded and saved, you still might have too much data to build a meaningful insight. For example, 1000 complaints will still have ~10,000 words if each complaint is only ten words!

Not helpfulNot helpful

In this case, limit this insight's Max # of rows. Go to the insight settings and set it to 100.

JoeM_4-1672861505998.png

Much better!!

Related Materials

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Contributors
Version history
Last update:
‎01-04-2023 01:39 PM
Updated by: