Run SQL Queries from PySpark and Incorta Notebook

suxinji · ‎05-16-2022

Overview
Solution
Use PySpark sql function
Use the Notebook's %sql interpreter
References

Overview

In this article, you will learn how to run SQL queries from PySpark in Incorta Notebook. There are two ways to run SQL queries from PySpark.

Solution

Use PySpark sql function

One way is using the Spark sql() function to run the sql queries in the %pyspark paragraph. In this way, the SQL function on a SparkSession in Incorta Notebook enables you to run the SQL queries and returns the results as a DataFrame.

%pyspark
# load data 
df = read("Titanic_PySpark.Titanic_train")
# create a temp view
df_temp = df.createOrReplaceTempView('temp')
# run sql queries
sql_df = spark.sql("SELECT Survived, Pclass, Name FROM temp")
# show data 
incorta.show(sql_df)
# save data frame 
save(sql_df)

Screen Shot 2022-04-19 at 10.30.07 AM.png

You can create temp views from the DataFrame and enable it to be queryable from another SQL statement.

%pyspark
sql_df.createOrReplaceTempView("DF1")

You can basically build a multi-steps data pipeline within a Materialized View (MV) and show the result of each step within Incorta Notebook.

Use the Notebook's %sql interpreter

A second way is to create a temp view and run the SQL query in another %sql paragraph. The difference here is that you cannot save the result as a DataFrame. You can, however, quickly explore your data this way. Also, Incorta Notebook's built-in visualizations can be used to analyze your data.

Code Sample:

%pyspark
# load data 
df = read("Titanic_PySpark.Titanic_train")
# create a temp view
df_temp = df.createOrReplaceTempView('temp')

%sql 
select Survived, Pclass, Name from temp

%pyspark
save(df)

Screen Shot 2022-04-18 at 10.32.27 AM.png

References

Creating Materialized View using PySpark