<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic PySpark Regressions using pyspark.ml Library in Dashboards &amp; Analytics Discussions</title>
    <link>https://community.incorta.com/t5/dashboards-analytics-discussions/pyspark-regressions-using-pyspark-ml-library/m-p/5442#M630</link>
    <description>&lt;P&gt;I am developing a pipeline for some regression modeling I am experimenting with and I've got a working script and output that I am reasonably happy with. However I am unable to write new scripts using the ml library. I'm not even able to copy and paste my working code into a new materialized view and run it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I copy and paste into a new materialized view I start hitting errors after all my data cleaning when I try to fit my regression here&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Importing libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, DoubleType
# ML library
# documentation: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
# [...]
# Skipping my data cleaning process for sake of simplicity
# [...]

lr = LinearRegression(featuresCol='features', labelCol='target')
lr_model = lr.fit(training_data)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I return the following error message&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Error An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator
Py4JJavaError : An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This exact script works fine in the materialized view I developed it in. However if I copy it to a new materialized view to alter (for example if I want to test out some different modeling methods like decision trees or time lag modeling) then I receive&amp;nbsp; the above error.&lt;/P&gt;&lt;P&gt;How can I reliably use the ml library in Incorta?&lt;/P&gt;</description>
    <pubDate>Sun, 28 Jan 2024 23:20:13 GMT</pubDate>
    <dc:creator>mkrieger</dc:creator>
    <dc:date>2024-01-28T23:20:13Z</dc:date>
    <item>
      <title>PySpark Regressions using pyspark.ml Library</title>
      <link>https://community.incorta.com/t5/dashboards-analytics-discussions/pyspark-regressions-using-pyspark-ml-library/m-p/5442#M630</link>
      <description>&lt;P&gt;I am developing a pipeline for some regression modeling I am experimenting with and I've got a working script and output that I am reasonably happy with. However I am unable to write new scripts using the ml library. I'm not even able to copy and paste my working code into a new materialized view and run it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I copy and paste into a new materialized view I start hitting errors after all my data cleaning when I try to fit my regression here&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Importing libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, DoubleType
# ML library
# documentation: https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
# [...]
# Skipping my data cleaning process for sake of simplicity
# [...]

lr = LinearRegression(featuresCol='features', labelCol='target')
lr_model = lr.fit(training_data)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I return the following error message&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Error An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator
Py4JJavaError : An error occurred while calling o605.fit.
: java.util.NoSuchElementException: next on empty iterator&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This exact script works fine in the materialized view I developed it in. However if I copy it to a new materialized view to alter (for example if I want to test out some different modeling methods like decision trees or time lag modeling) then I receive&amp;nbsp; the above error.&lt;/P&gt;&lt;P&gt;How can I reliably use the ml library in Incorta?&lt;/P&gt;</description>
      <pubDate>Sun, 28 Jan 2024 23:20:13 GMT</pubDate>
      <guid>https://community.incorta.com/t5/dashboards-analytics-discussions/pyspark-regressions-using-pyspark-ml-library/m-p/5442#M630</guid>
      <dc:creator>mkrieger</dc:creator>
      <dc:date>2024-01-28T23:20:13Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark Regressions using pyspark.ml Library</title>
      <link>https://community.incorta.com/t5/dashboards-analytics-discussions/pyspark-regressions-using-pyspark-ml-library/m-p/6476#M800</link>
      <description>&lt;P&gt;&lt;A id="link_6" class="page-link lia-link-navigation lia-custom-event" href="https://community.incorta.com/t5/data-schemas-knowledgebase/incorta-mv-execution-succeeds-in-the-incorta-notebook-but-fails/ta-p/2894" target="_blank"&gt;Incorta MV execution succeeds in the Incorta Notebook but fails to save&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;The issue with "next on empty iteractor" is probably due to the lack of data.&lt;/P&gt;
&lt;P&gt;We add sampling logic when a MV is saved first time for improving the performance.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It may become an issue if the logic assumes data exist.&lt;/P&gt;</description>
      <pubDate>Tue, 06 May 2025 23:36:47 GMT</pubDate>
      <guid>https://community.incorta.com/t5/dashboards-analytics-discussions/pyspark-regressions-using-pyspark-ml-library/m-p/6476#M800</guid>
      <dc:creator>dylanwan</dc:creator>
      <dc:date>2025-05-06T23:36:47Z</dc:date>
    </item>
  </channel>
</rss>

