Build Machine Learning Models using Incorta Materialized View
Incorta Materialized View provides a way to run pyspark, scala, and R, and can be used for building the machine learning models.
We will discuss about the pyspark, scala, and R separately.
Here are the best practices of using Incorta ML for model training and testing
- Incorta ML requires you to save a dataframe as the result. This can be any dataframe. We can use the result of applying the model to the training or testing data.
- Model building and the actual inference using the model can be in separate MVs and they can be placed in different schema
- Use the incorta_ml package which can simplify your model building process. Currently incorta_ml is available in pyspark.
Please set the property spark.dataframe.sampling.enabled to false for the Incorta MV that is used for building the ML model. Incorta MV, by default, use data sampling during saving the MV.
All the model should be saved to the same location on the disk in the on premise environment. A good place is to create a folder under <Incorta Tenant Folder>/data/model. Putting the model in the shared tenant folder will allow you to access in a multiple node environment.
Different ML libraries may provide different way of saving the models. Use those corresponding native ML library API to save the models.
The ML job load may be very different from the other regular data refresh jobs. Test a small data set first and assess the impact before you run or deploy the model building MVs