Issue with converting a Pandas DataFrame to a Spar...

dylanwan · ‎11-15-2023

Symptoms

You received the error when trying to convert a Pandas DataFrame to Spark DataFrame in a PySpark MV. Here is the error.

- INC_03070101: Transformation error Error 'DataFrame' object has no attribute 'iteritems' AttributeError : 'DataFrame' object has no attribute 'iteritems'

Diagnosis

Since Pandas 2.0.0, which was released around April 2023, the iteritems() method was deprecated and replaced by the items() method. However, spark.createDataFrame() may still use the old method.

Solutions

Downgrade the Pandas version to an earlier version, such as 1.5.3.
Upgrade Spark to Spark 3.4.1 or later
Rewrite your code according to the following example:

import pandas as pd 

month_data = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}
pdf = pd.DataFrame(month_data.items(), columns=['Month', 'Month_Number'])
pdf.set_index('Month_Number', inplace=True)

# Add this line, before you call createDataFrame()
pdf.iteritems = pdf.items

df = spark.createDataFrame(pdf)

save(df)

The above sample code created a MV that shows a list of months with the index that can be used for sorting. We first created a python dictionary and create it as a Pandas DataFrame for the purpose of reproducing the problem.

After we added the line as highlighted in the code, the issue went away.

The issue can be seen in the Incorta Data Profiler data application. The workaround is to downgrade the Pandas version for now.