Introduction
This article discusses how to unlock complex data questions with Spark SQL Views and streamline Data Access with Real-World Examples.
What you should know before reading this article
We recommend that you be familiar with these Incorta concepts before exploring this topic further.
Applies to
These concepts apply to Incorta versions 2024.7.3 and above
Pre-requisites
- Work with the cloud support team to enable the Premium feature and the Advanced SQL Interface in a Incorta cloud cluster
- Enable Advanced SQL option from the CMC->Cluster Configuration->Incorta Labs
- Enable Spark SQL Views option from the CMC->Cluster Configuration->Incorta Labs
Let’s Go
- Advantages of using a Spark SQL View:
- Dynamic View created in Business Schema using SQL
- Can refer to any object in physical or business schema . (Exception: SQL view can not refer to SQL view and Analyzer View)
- Uses Incorta DDM and Spark
- Supports all kind of Joins
- Output of view is not materialized
- Interacts with dashboard using presentation variable
- Full power of Spark SQL
Here are some real world examples of how Spark SQL views can be used. Keep in mind that with Spark SQL views you can use any sql functions supported in Spark 3.x.
Show me data > than some Percentile in a Dashboard
- Use Case: You have customers sales orders and you want to find out a list of customers whose total order amounts are greater than Nth percentile where Nth percentile will be a user input, for example 90, 80, 70, 60 etc...
- Make sure you have a verified base business view in a business schema with columns as Customer, Order Amount. If it is verified the view will have a green check mark
- Create a new business view of type Spark SQL View . In the screenshot below that view is called Percentile with the sql below
- Notice it uses a getPresVar() function, this allows a dashboard presentation variable p_var to be passed to the Spark SQL View
- It also uses the spark sql’s PERCENTILE() function
data:image/s3,"s3://crabby-images/94c84/94c8474f555ce38a82b81c529e8bd2204e8713b0" alt="amit_kothari_1-1733781802376.png amit_kothari_1-1733781802376.png"
WITH Customer_Aggregates AS
(SELECT Customer,
SUM(Order_Amount) AS total_order_amount
FROM OrderToCashCopilot.Order2Cash
WHERE Order_Amount > 1000
GROUP BY Customer
),
Percentile_Calculation AS
(SELECT case
when getPresVar("p_var", "90") = 90
then PERCENTILE(total_order_amount, 0.9)
when getPresVar("p_var", "90") = 80
then PERCENTILE(total_order_amount, 0.8)
when getPresVar("p_var", "90") = 70
then PERCENTILE(total_order_amount, 0.7)
when getPresVar("p_var", "90") = 60
then PERCENTILE(total_order_amount, 0.6)
else PERCENTILE(total_order_amount, 0.5)
end percentile_n
FROM
Customer_Aggregates
)
SELECT c.Customer,
round(c.total_order_amount) total_order_amount
FROM Customer_Aggregates c
JOIN
Percentile_Calculation p
ON c.total_order_amount > p.percentile_n
ORDER BY
c.total_order_amount DESC
- Now create a dashboard on top of this Business View with a presentation variable p_var which has the percentile values of 90, 80, 70, 60 and 50
data:image/s3,"s3://crabby-images/311a2/311a2e8a0f2398fc1e5a9253e854013b27436b41" alt="amit_kothari_2-1733781802393.png amit_kothari_2-1733781802393.png"
Show me how to report outliers based on Standard Deviation
- Use Case: You have customers sales orders and you want to calculate the standard deviation of the amounts and report the outliers. We will use the STDDEV spark sql function.
- Make sure you have a verified base business view in a business schema . If it is verified the view will have a green check mark
- Create a new business view of type Spark SQL View in a new or existing business schema . In the screenshot below that view is called sales_data with the sql below , it defines outliers to be 3 standard deviation from the mean
- Create a dashboard on top of this Business View
WITH stddev_per_customer AS (
SELECT
AVG(sales_amount) AS sales_avg,
STDDEV(sales_amount) AS sales_stddev
FROM sales.sales_data
)
SELECT
s.customer,
s.sales_amount,
sp.sales_avg,
sp.sales_stddev
FROM sales.sales_data s
JOIN stddev_per_customer sp
ON ( s.sales_amount < (sp.sales_avg - 3*sp.sales_stddev)
OR s.sales_amount > (sp.sales_avg + 3*sp.sales_stddev)
)
Showing percent of total in Sub Total
- Please refer to this video , check below for additional videos
Related Material