09-25-2023 08:36 AM - edited 10-02-2023 01:33 PM
Zeppelin Notebooks (Materialized View interactive notebooks) have gained popularity as a versatile and interactive environment for data analysis and visualization. One of the standout features that sets Zeppelin apart is its support for dynamic forms. Dynamic forms empower data scientists and analysts to create more user-friendly and customizable notebooks by providing interactive widgets that allow users to modify variables, parameters, and inputs on the fly. In this knowledgebase article, we will delve into the world of dynamic forms in Zeppelin Notebooks, exploring their functionality, and follow on with an example. Whether you are a seasoned Zeppelin user or just getting started, understanding dynamic forms can significantly enhance your notebook-building capabilities, making your data-driven narratives more engaging and accessible.
The interactive notebook can afford data scientists and analysts quite a few capabilities to investigate, discover, and query data. However, when updating your scripts, you might find that many revisions are needed to iterate through different scenarios.
Wouldn't it be much easier to create a simple form that will, in turn, update your script?
Once you make your script dynamic, you can enhance exploration with:
For this use case, we will use the OnlineStore Data that defaults with the Incorta cloud. We want to do some in-notebook investigation and graph generation to understand our sales trends. However, we want to create a script allowing a user to select Sales realization by Order Date, Due Date, or Ship Date.
If you're good, you can start programming a placeholder variable into your script, but I prefer to choose one sample variable to create a working script. This way, I can quickly identify something wrong with my script logic vs. how forms have been implemented.
%pyspark
#Read in table
import pandas as pd
from pyspark.sql.functions import year, month
df = read("OnlineStore.salesorderheader")
#Create a temp view to query against using SparkSQL
df.createOrReplaceTempView('Preaggregate')
# Aggregate sales by day for August 2011
# Group by Month
df_item_with_data = spark.sql("""
select OrderDate, sum(TotalDue) as TotalDue
from Preaggregate
where OrderDate >= '2011-08-01' and OrderDate <= '2011-08-31'
Group by
OrderDate
Order by
OrderDate
""")
#Select fields based on the dynamic form
x_data = df_item_with_data.select("OrderDate").rdd.flatMap(lambda x: x).collect()
y_data = df_item_with_data.select("TotalDue").rdd.flatMap(lambda x: x).collect()
#Plot data
import matplotlib.pyplot as plt
import pandas as pd
plt.figure(figsize=(8, 10))
plt.plot(x_data, y_data, marker='o', linestyle='-', color='b', label='Line Chart')
plt.xticks(rotation=45)
plt.xlabel("OrderDate")
plt.ylabel("Y-axis Label")
plt.title("Line Chart")
plt.legend()
plt.grid(True) # Add grid lines
plt.show()
Output:
I find it easier to run basic aggregations in SQL, but more experienced Python pros will undoubtedly see more optimal methods than creating the temp table, as I did above.
Regardless, I have a working output. You, however, might want to view what this looks like to see the total revenue shipped by ShipDate. If interested, you'll have to update every mention of 'OrderDate' in this short script (7 places). We can make this so you can toggle a predefined set of fields.
Create a dynamic form for a drop-down by declaring the form label and each value pair. Each pair is actual value vs. display value.
selected_col = incorta.noteSelect("Select Date Column",[("OrderDate","Order Date"),("DueDate","Due Date"),("ShipDate","Ship Date")],"OrderDate")
When running this snippet, you'll see a drop-down at the top of the notebook.
Selecting the drop-down and then running the snippet will map any of these three values to the variable selected_col.
Now, we can use selected_col in our script and it will dynamically update based on our drop-down selection.
selected_col = incorta.noteSelect("Select Date Column",[("OrderDate","Order Date"),("DueDate","Due Date"),("ShipDate","Ship Date")],"OrderDate")
import pandas as pd
from pyspark.sql.functions import year, month
df = read("OnlineStore.salesorderheader")
df.createOrReplaceTempView('Preaggregate')
# Group by Month
df_item_with_data = spark.sql("""
select """+ selected_col +""", sum(TotalDue) as TotalDue
from Preaggregate
where """+ selected_col +""" >= '2011-08-01' and """+ selected_col +""" <= '2011-08-31'
Group by
"""+ selected_col +"""
Order by
"""+ selected_col +"""
""")
#Select fields based on the dynamic form
x_data = df_item_with_data.select(selected_col).rdd.flatMap(lambda x: x).collect()
y_data = df_item_with_data.select("TotalDue").rdd.flatMap(lambda x: x).collect()
#Plot data
import matplotlib.pyplot as plt
import pandas as pd
plt.figure(figsize=(8, 10))
plt.plot(x_data, y_data, marker='o', linestyle='-', color='b', label='Line Chart')
plt.xticks(rotation=45)
plt.xlabel(selected_col)
plt.ylabel("Y-axis Label")
plt.title("Line Chart")
plt.legend()
plt.grid(True) # Add grid lines
Output:
⚠️ Note that just selecting the drop-down is not enough. You'll need to run the dynamic form snippet to assign the value to selected_col. Then, run the impacted paragraphs using selected_col.
In this example, we've shared a method that allows the notebook user to 'select' an item from a drop-down. There are three other options available in dynamic forms:
Method | Description |
noteCheckbox(self, name, options, defaultChecked=[]) | Create a checkbox selection in Notebook. Valid for multiple parameter changes. |
noteSelect(self, name, options, defaultValue='') | Create a radio selection in Notebook. It helps allow only one parameter change at a time. |
noteTextbox(self, name, defaultValue='') | Create a text box form in Notebook. Useful for defining specified values. |
Learn more about the other forms here in the Zepplin documentation.