01-04-2023 01:29 PM - edited 01-04-2023 01:39 PM
When using a tag (word) cloud, you may find a whole text field being applied to the insight. That may be useful in some cases if the text fields contain one or two words, but extra preparation could be required when working with longer-form text.
In the above insight, each row in the dummy fake customer complaint data is likely unique. Therefore, we want to transform our data from reviews to the most frequently mentioned words to find any "themes."
Before reading this article, a basic understanding of creating materialized views will help you follow along.
Attached is the sample data. Please feel free to use it to follow along
Once importing the attached file to a schema, add a materialized view. Instead of reading the whole table, I've queried directly against the Complaint_Data and limited it to 1000 reviews.
df = incorta_sql("SELECT Community_Data.Complaint_Data.Consumer_complaint_narrative FROM Community_Data.Complaint_Data WHERE Community_Data.Complaint_Data.Consumer_complaint_narrative IS NOT NULL LIMIT 1000")
Next, we will go through each row and break each word into its own row.
import pyspark.sql.functions as f
tokenize = df.withColumn('word', f.explode(f.split(f.col('Consumer_complaint_narrative'), ' ')))
As a result, the data will appear as follows:
Next, let's summarize to see how many mentions of each word we have.
word_count = tokenize.groupBy("word").count().sort('count', ascending=False)
Result:
Chances are that once you view your data, you will want to include some edits. Maybe filter punctuations, remove uninsightful words ("I," "a"), or perhaps provide a list of words you want to find. In this instance, I've removed some words that give no insight into the complaints. More often than not, this process could be pretty complex if we notice several data quality issues.
#Filter IS NOT IN List values
li=["the","to","I","and","have", "for", "is", "this", "me", "XXXX", "a", "my", "that", "of", "was", "in", "on", "XXXX,", "it", "as", "are", "by", "or", "been", "would", "(", ")","", "XX/XX/XXXX", "XXXX."]
filtered_word_count = word_count.filter(~word_count.word.isin(li))
save(filtered_word_count)
Even after the materialized view is loaded and saved, you still might have too much data to build a meaningful insight. For example, 1000 complaints will still have ~10,000 words if each complaint is only ten words!
In this case, limit this insight's Max # of rows. Go to the insight settings and set it to 100.
Much better!!