on 06-04-2024 02:50 PM
Sparkflows provides a no code or low code alternative to writing an Incorta materialized view by yourself. This article discusses how to enable Sparkflows to use Incorta to submit spark jobs to Incorta Cloud and how to deploy the Sparkflows workflow as PySpark.
Chidori is a technology as well a product service provided by Incorta to enable Incorta to run Spark jobs without affecting regular Incorta. To learn more about the Chidori service from Incorta, please watch this video.
Sparkflows supports two types of connections:
By default, Sparkflows jobs are submitted on the local machine itself. Sparkflows can also be configured to submit the jobs to a cluster via its compute connection. Chidori is the AI/ML cluster management service from Incorta. By connecting to Incorta Chidori, Sparkflows jobs are submitted to the Incorta K8S Spark cluster.
In Sparkflows, navigate to Administration -> Configuration -> Global Connection
Pick "Chidori" as the connection type.
Create a Dataset with an Incorta object store connection:
Use ‘Read Incorta’ Node in a Sparkflows workflow to ingest data from Incorta
After you refresh schema from Incorta, the columns and types of columns will be refreshed in Sparkflows and available for the rest of the workflows.
Use the Save Parquet node to persist the output data in a GCS bucket which is accessible to Incorta
In the Sparkflows Model Registry, the model summary, the hyper-parameter, performance metrics, feature importances and model path are stored for the executed model. It allows users to compare the different models.
PySpark Code can be generated for workflow and executed on any Spark environment by following the below steps:
Click on the Copy to Clipboard button to copy the generated code.
Below is a dashboard in Incorta with Order Summary details and Predicted CLTV for each customer.