PySpark Regressions using pyspark.ml Library

mkrieger — Sun, 28 Jan 2024 23:20:13 GMT

I am developing a pipeline for some regression modeling I am experimenting with and I've got a working script and output that I am reasonably happy with. However I am unable to write new scripts using the ml library. I'm not even able to copy and paste my working code into a new materialized view and run it.

If I copy and paste into a new materialized view I start hitting errors after all my data cleaning when I try to fit my regression here

# Importing libraries from pyspark.sql import SparkSession from pyspark.sql.functions import * import pyspark.sql.functions as F from pyspark.sql import Row from pyspark.sql.types import ArrayType, DoubleType # ML library # documentation: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html from pyspark.ml.feature import VectorAssembler from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml import Pipeline from pyspark.ml.linalg import Vectors # [...] # Skipping my data cleaning process for sake of simplicity # [...] lr = LinearRegression(featuresCol='features', labelCol='target') lr_model = lr.fit(training_data)

I return the following error message

Error An error occurred while calling o605.fit. : java.util.NoSuchElementException: next on empty iterator Py4JJavaError : An error occurred while calling o605.fit. : java.util.NoSuchElementException: next on empty iterator

This exact script works fine in the materialized view I developed it in. However if I copy it to a new materialized view to alter (for example if I want to test out some different modeling methods like decision trees or time lag modeling) then I receive the above error.

How can I reliably use the ml library in Incorta?

Re: PySpark Regressions using pyspark.ml Library

dylanwan — Tue, 06 May 2025 23:36:47 GMT

Incorta MV execution succeeds in the Incorta Notebook but fails to save

The issue with "next on empty iteractor" is probably due to the lack of data.

We add sampling logic when a MV is saved first time for improving the performance.

It may become an issue if the logic assumes data exist.

topic PySpark Regressions using pyspark.ml Library in Dashboards & Analytics Discussions

PySpark Regressions using pyspark.ml Library

Re: PySpark Regressions using pyspark.ml Library