PySpark Regressions using pyspark.ml Library

mkrieger · ‎01-28-2024

I am developing a pipeline for some regression modeling I am experimenting with and I've got a working script and output that I am reasonably happy with. However I am unable to write new scripts using the ml library. I'm not even able to copy and paste my working code into a new materialized view and run it.

If I copy and paste into a new materialized view I start hitting errors after all my data cleaning when I try to fit my regression here

# Importing libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, DoubleType
# ML library
# documentation: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
# [...]
# Skipping my data cleaning process for sake of simplicity
# [...]

lr = LinearRegression(featuresCol='features', labelCol='target')
lr_model = lr.fit(training_data)

I return the following error message

Error An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator
Py4JJavaError : An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator

This exact script works fine in the materialized view I developed it in. However if I copy it to a new materialized view to alter (for example if I want to test out some different modeling methods like decision trees or time lag modeling) then I receive the above error.

How can I reliably use the ml library in Incorta?