PySpark Regressions using pyspark.ml Library

mkrieger · ‎01-28-2024

I am developing a pipeline for some regression modeling I am experimenting with and I've got a working script and output that I am reasonably happy with. However I am unable to write new scripts using the ml library. I'm not even able to copy and paste my working code into a new materialized view and run it.

If I copy and paste into a new materialized view I start hitting errors after all my data cleaning when I try to fit my regression here

# Importing libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, DoubleType
# ML library
# documentation: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
# [...]
# Skipping my data cleaning process for sake of simplicity
# [...]

lr = LinearRegression(featuresCol='features', labelCol='target')
lr_model = lr.fit(training_data)

I return the following error message

Error An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator
Py4JJavaError : An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator

This exact script works fine in the materialized view I developed it in. However if I copy it to a new materialized view to alter (for example if I want to test out some different modeling methods like decision trees or time lag modeling) then I receive the above error.

How can I reliably use the ml library in Incorta?

dylanwan · ‎05-06-2025

Incorta MV execution succeeds in the Incorta Notebook but fails to save

The issue with "next on empty iteractor" is probably due to the lack of data.

We add sampling logic when a MV is saved first time for improving the performance.

It may become an issue if the logic assumes data exist.