cancel
Showing results for 
Search instead for 
Did you mean: 
dylanwan
Employee
Employee

Overview

Sparkflows provides a no code or low code alternative to writing an Incorta materialized view by yourself.  This article discusses how to enable Sparkflows to use Incorta to submit spark jobs to Incorta Cloud and how to deploy the Sparkflows workflow as PySpark.

What is Chidori?

Chidori is a technology as well a product service provided by Incorta to enable Incorta to run Spark jobs without affecting regular Incorta. To learn more about the Chidori service from Incorta, please watch this video.


Incorta and Sparkflows Integration

dylanwan_0-1715866539629.pngBelow are steps to integrate Incorta and Sparkflows.

Configure the Connections

Sparkflows supports two types of connections:

  • compute connection
  • data connection

By default, Sparkflows jobs are submitted on the local machine itself.  Sparkflows can also be configured to submit the jobs to a cluster via its compute connection.  Chidori is the AI/ML cluster management service from Incorta. By connecting to Incorta Chidori, Sparkflows jobs are submitted to the Incorta K8S Spark cluster.

Connect to Incorta Chidori as a compute connection from Sparkflows

In Sparkflows, navigate to Administration -> Configuration -> Global Connection

dylanwan_0-1715842774580.png

Pick "Chidori" as the connection type.

dylanwan_1-1715842831796.png
You can add the jar or python files required for your Sparkflows jobs.
dylanwan_2-1715842869107.png

Create Project and workflow in Sparkflows

Connect to Incorta as a data source from Sparkflows

Create a Dataset with an Incorta object store connection:

dylanwan_3-1715843031586.pngdylanwan_4-1715843046347.png

Read Incorta Data in a Sparkflows Workflow

 Use ‘Read Incorta’ Node in a Sparkflows workflow to ingest data from Incorta

dylanwan_5-1715843118914.png
"Read Incorta" allows you to use the data that are extracted, transformed, and enriched in Incorta in Sparkflows for further processing.
In the below example, Demand_Forecasting_Usecase is an incorta schema name. The Demand_forecasting_Train is a table or a materialized view in Incorta.
 
dylanwan_6-1715843140512.png

After you refresh schema from Incorta, the columns and types of columns will be refreshed in Sparkflows and available for the rest of the workflows.

Save and Share data with Incorta from Sparkflows

Use the Save Parquet node to persist the output data in a GCS bucket which is accessible to Incorta

dylanwan_7-1715843240816.png

 

dylanwan_8-1715843270418.png

ML Model Registry

Save a ML Model to Incorta

dylanwan_9-1715843297052.png

 

dylanwan_10-1715843304884.png

View Model Registry in Sparkflows

In the Sparkflows Model Registry, the model summary, the hyper-parameter, performance metrics, feature importances and model path are stored for the executed model. It allows users to compare the different models.

dylanwan_0-1715868365652.png

Create an Incorta MV using the PySpark generated by Sparkflows

Generate PySpark Code

PySpark Code can be generated for workflow and executed on any Spark environment by following the below steps:

dylanwan_1-1715866694684.png

 Click on the Copy to Clipboard button to copy the generated code.

dylanwan_2-1715866710263.png

Below is a dashboard in Incorta with Order Summary details and Predicted CLTV for each customer.

Image 21-05-2024 at 18.23.jpeg

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Contributors
Version history
Last update:
‎06-04-2024 02:50 PM
Updated by: