Overview

dylanwan · ‎06-04-2024

Overview
What is Chidori?
Incorta and Sparkflows Integration
Configure the Connections
Create Project and workflow in Sparkflows
ML Model Registry
Create an Incorta MV using the PySpark generated by Sparkflows

Overview

Sparkflows provides a no code or low code alternative to writing an Incorta materialized view by yourself. This article discusses how to enable Sparkflows to use Incorta to submit spark jobs to Incorta Cloud and how to deploy the Sparkflows workflow as PySpark.

What is Chidori?

Chidori is a technology as well a product service provided by Incorta to enable Incorta to run Spark jobs without affecting regular Incorta. To learn more about the Chidori service from Incorta, please watch this video.

Incorta and Sparkflows Integration

Below are steps to integrate Incorta and Sparkflows.

Configure the Connections

Sparkflows supports two types of connections:

compute connection
data connection

By default, Sparkflows jobs are submitted on the local machine itself. Sparkflows can also be configured to submit the jobs to a cluster via its compute connection. Chidori is the AI/ML cluster management service from Incorta. By connecting to Incorta Chidori, Sparkflows jobs are submitted to the Incorta K8S Spark cluster.

Connect to Incorta Chidori as a compute connection from Sparkflows

In Sparkflows, navigate to Administration -> Configuration -> Global Connection

Pick "Chidori" as the connection type.

You can add the jar or python files required for your Sparkflows jobs.

Create Project and workflow in Sparkflows

Connect to Incorta as a data source from Sparkflows

Create a Dataset with an Incorta object store connection:

Read Incorta Data in a Sparkflows Workflow

Use ‘Read Incorta’ Node in a Sparkflows workflow to ingest data from Incorta

"Read Incorta" allows you to use the data that are extracted, transformed, and enriched in Incorta in Sparkflows for further processing.

In the below example, Demand_Forecasting_Usecase is an incorta schema name. The Demand_forecasting_Train is a table or a materialized view in Incorta.

After you refresh schema from Incorta, the columns and types of columns will be refreshed in Sparkflows and available for the rest of the workflows.

Save and Share data with Incorta from Sparkflows

Use the Save Parquet node to persist the output data in a GCS bucket which is accessible to Incorta

ML Model Registry

Save a ML Model to Incorta

View Model Registry in Sparkflows

In the Sparkflows Model Registry, the model summary, the hyper-parameter, performance metrics, feature importances and model path are stored for the executed model. It allows users to compare the different models.

Create an Incorta MV using the PySpark generated by Sparkflows

Generate PySpark Code

PySpark Code can be generated for workflow and executed on any Spark environment by following the below steps:

Click on the Copy to Clipboard button to copy the generated code.

Below is a dashboard in Incorta with Order Summary details and Predicted CLTV for each customer.

Image 21-05-2024 at 18.23.jpeg

Incorta Integration with Sparkflows using Chidori

Overview

What is Chidori?

Incorta and Sparkflows Integration

Configure the Connections

Connect to Incorta Chidori as a compute connection from Sparkflows

Create Project and workflow in Sparkflows

Connect to Incorta as a data source from Sparkflows

Read Incorta Data in a Sparkflows Workflow

Save and Share data with Incorta from Sparkflows

ML Model Registry

Save a ML Model to Incorta

View Model Registry in Sparkflows

Create an Incorta MV using the PySpark generated by Sparkflows

Generate PySpark Code