- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎04-12-2025 10:32 PM - edited ‎04-12-2025 10:33 PM
Hi team,
I am trying a RF model in the incorta.
even after splitting the data as suggested below i am getting a error as follows
Transformation error 25/04/13 05:28:44 ERROR Instrumentation: org.apache.spark.SparkException: ML algorithm was given empty dataset. Error An error occurred while calling o341.fit.
%pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import co
# Load dataset
df_train = read("machine_learning.train_data")
df_test = read("machine_learning.Test_data")
# Define features and label
feature_cols = ['Temperature', 'Pressure', 'Vibration_Level', 'Humidity', 'Power_Consumption']
label_col = 'Failure_Status'
# Assemble features into a single vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_vector = assembler.transform(df_train).select("features", col(label_col).cast("int").alias(label_col))
# Train-test split
# train_data, test_data = df_vector.randomSplit([0.8, 0.2], seed=42)
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_train_vector = assembler.transform(df_train).select("features", col(label_col).cast("int").alias(label_col))
df_test_vector = assembler.transform(df_test).select("features", col(label_col).cast("int").alias(label_col))
# Train Random Forest model
rf = RandomForestClassifier(labelCol=label_col, featuresCol="features", numTrees=100, seed=42)
rf_model = rf.fit(df_train_vector)
# Predict on test set
predictions = rf_model.transform(df_test_vector)
# Evaluate accuracy
# evaluator = MulticlassClassificationEvaluator(labelCol=label_col, predictionCol="prediction", metricName="accuracy")
# accuracy = evaluator.evaluate(predictions)
# print("Accuracy:", accuracy)
# Extract predictions and actual labels
# y_pred = predictions.select("prediction").rdd.flatMap(lambda x: x).collect()
# y_true = predictions.select(label_col).rdd.flatMap(lambda x: x).collect()
# Print classification report
# print(classification_report(y_true, y_pred))
# Combine results for inspection
# results = predictions.select(*feature_cols, label_col, col("prediction").alias("Predicted_Failure_Status"))
# Show first 10 rows
predictions.show(10)
save(predictions)
Can any one help what is the issue here ?
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎04-14-2025 11:05 AM
This issue has been resolved using the property spark.dataframe.sampling.enabled == false
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎04-16-2025 01:55 PM
Glad you were able to figure it out! This would have been my suggestion.
