on 05-16-2022 10:53 AM
In this article, you will learn how to run SQL queries from PySpark in Incorta Notebook. There are two ways to run SQL queries from PySpark.
One way is using the Spark sql() function to run the sql queries in the %pyspark paragraph. In this way, the SQL function on a SparkSession in Incorta Notebook enables you to run the SQL queries and returns the results as a DataFrame.
%pyspark
# load data
df = read("Titanic_PySpark.Titanic_train")
# create a temp view
df_temp = df.createOrReplaceTempView('temp')
# run sql queries
sql_df = spark.sql("SELECT Survived, Pclass, Name FROM temp")
# show data
incorta.show(sql_df)
# save data frame
save(sql_df)
You can create temp views from the DataFrame and enable it to be queryable from another SQL statement.
%pyspark
sql_df.createOrReplaceTempView("DF1")
You can basically build a multi-steps data pipeline within a Materialized View (MV) and show the result of each step within Incorta Notebook.
A second way is to create a temp view and run the SQL query in another %sql paragraph. The difference here is that you cannot save the result as a DataFrame. You can, however, quickly explore your data this way. Also, Incorta Notebook's built-in visualizations can be used to analyze your data.
Code Sample:
%pyspark
# load data
df = read("Titanic_PySpark.Titanic_train")
# create a temp view
df_temp = df.createOrReplaceTempView('temp')
%sql
select Survived, Pclass, Name from temp
%pyspark
save(df)