Solved: ML algorithm was given empty dataset.

shaikshoaib · ‎04-12-2025

Hi team,

I am trying a RF model in the incorta.

even after splitting the data as suggested below i am getting a error as follows

https://community.incorta.com/t5/data-schemas-knowledgebase/split-a-dataset-into-training-and-testin...

Transformation error 25/04/13 05:28:44 ERROR Instrumentation: org.apache.spark.SparkException: ML algorithm was given empty dataset.
Error An error occurred while calling o341.fit.

%pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import co

# Load dataset
df_train = read("machine_learning.train_data")
df_test = read("machine_learning.Test_data")

# Define features and label
feature_cols = ['Temperature', 'Pressure', 'Vibration_Level', 'Humidity', 'Power_Consumption']
label_col = 'Failure_Status'

# Assemble features into a single vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_vector = assembler.transform(df_train).select("features", col(label_col).cast("int").alias(label_col))

# Train-test split
# train_data, test_data = df_vector.randomSplit([0.8, 0.2], seed=42)

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_train_vector = assembler.transform(df_train).select("features", col(label_col).cast("int").alias(label_col))
df_test_vector = assembler.transform(df_test).select("features", col(label_col).cast("int").alias(label_col))

# Train Random Forest model
rf = RandomForestClassifier(labelCol=label_col, featuresCol="features", numTrees=100, seed=42)
rf_model = rf.fit(df_train_vector)

# Predict on test set
predictions = rf_model.transform(df_test_vector)

# Evaluate accuracy
# evaluator = MulticlassClassificationEvaluator(labelCol=label_col, predictionCol="prediction", metricName="accuracy")
# accuracy = evaluator.evaluate(predictions)
# print("Accuracy:", accuracy)

# Extract predictions and actual labels
# y_pred = predictions.select("prediction").rdd.flatMap(lambda x: x).collect()
# y_true = predictions.select(label_col).rdd.flatMap(lambda x: x).collect()

# Print classification report
# print(classification_report(y_true, y_pred))

# Combine results for inspection
# results = predictions.select(*feature_cols, label_col, col("prediction").alias("Predicted_Failure_Status"))

# Show first 10 rows
predictions.show(10)
save(predictions)

Can any one help what is the issue here ?

shaikshoaib · ‎04-14-2025

This issue has been resolved using the property spark.dataframe.sampling.enabled == false

JoeM · ‎04-16-2025

Glad you were able to figure it out! This would have been my suggestion.