Py4JJavaError: An error occurred while calling o903.fit.

Below is the pySpark code I am trying to run:-

from   pyspark.sql   import   SparkSession
from pyspark.ml import Pipeline
from   pyspark.ml.linalg   import   Vectors
from   pyspark.ml.feature   import   IndexToString,   StringIndexer,   VectorAssembler
from   pyspark.ml.classification   import   RandomForestClassifier
from   pyspark.ml.evaluation   import   MulticlassClassificationEvaluator
df   =   read("sch_media.subscription_data")
indexerState   =   StringIndexer(inputCol="STATE",   outputCol="label")
indexerCity   =   StringIndexer(inputCol="CITY",   outputCol="CityIndex")
indexerBoqName   =   StringIndexer(inputCol="BOUQUET_NAME",   outputCol="BoqNameIndex")
indexerChannel   =   StringIndexer(inputCol="CHANNEL_NAME",   outputCol="ChannelIndex")
indexerStb  = StringIndexer(inputCol="STB_TYPE", outputCol="StbIndex")
indexerBoqAla = StringIndexer(inputCol="BQ_ALA", outputCol="BoqAlaIndex")
indexerRate = StringIndexer(inputCol="RATE", outputCol="RateIndex")
assembler   =   VectorAssembler(inputCols=["CityIndex","BoqNameIndex","ChannelIndex","StbIndex","BoqAlaIndex","RateIndex"],outputCol="features")
rf   =   RandomForestClassifier(labelCol="label",   featuresCol="features",   numTrees=10)
pipeline   =   Pipeline(stages=[indexerState,indexerCity,indexerBoqName,indexerChannel,indexerStb,indexerBoqAla,indexerRate, assembler,   rf])


I am getting this error:- Py4JJavaError: An error occurred while calling o903.fit.

The data is stored is AzureSql Database.

Kindly help!

2replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Alok,
    Before you fit the trainingData, could you please execute the following commands for me :


    I suspect that the number of processes of your trainingData is too large. 
    As a result, the DAG size (Directed Acyclical Graph - a logical flow of operations constructed by spark) for your data becomes too large to handle and you may end up getting the following error -
    Py4JJavaError: An error occurred while calling o903.fit.

    To fix this issue, I would recommend you to either use function checkpoint() or cache() before you fit your trainingData. 

    Here is how you can do it using Incorta notebook or MV:

    //Define SparkContext and create the checkpoint
    sc = spark.sparkContext

    Before, fitting your trainingData, apply checkpoint():

    # Now, check the size of your DAG
    # Displays the  length of physical plan

    In my case, I see the following:

    == Parsed Logical Plan ==
    LogicalRDD [id#339L, text#340, label#341], false
    == Analyzed Logical Plan ==
    id: bigint, text: string, label: double
    LogicalRDD [id#339L, text#340, label#341], false
    == Optimized Logical Plan ==
    LogicalRDD [id#339L, text#340, label#341], false
    == Physical Plan ==
    Scan ExistingRDD[id#339L,text#340,label#341]

    By using the function checkpoint() before fitting your data, you would shrink the DAG and hence you should be getting rid of Py4JJavaError.

    Also, thanks to medium post by @kittipatkampa - By converting your trainingData dataFrame into RDD (Resilient Distributed Dataset) and then back to DataFrame again will also shrink the DAG considerably.

    Again, I have tried this on our Incorta notebook and this works like a charm:

    trainingData = spark.createDataFrame(trainingData.rdd, schema=trainingData.schema)
    # Fit the pipeline to training documents.
    model = pipeline.fit(trainingData)

    Hope one of the above techniques should help in resolving your issue. 

    Like 1
  • Hi   Nithish , they ask me no module spark. where should I import it?

Like Follow
  • 3 wk agoLast active
  • 2Replies
  • 259Views
  • 3 Following

Product Announcement

Incorta 5 is now Generally Available