Generate profile reports using Pandas Profiling pa...

suxinji · ‎05-17-2022

Overview

Pandas profiling is an open-source Python module with which you can quickly do an exploratory data analysis with just a few lines of code.

In this article, you'll learn how to generate profile reports using Pandas Profiling package in Incorta Notebook.

Solution

1. Install the pandas profiling package

# install using the pip
pip install pandas-profiling
# or install the latest version directly from Github
pip install https://github.com/ydataai/pandas-profiling/archive/master.zip

For more information on how to install the package in Incorta Cloud, please review this article.

2. Generate the report

In Incorta notebook, we can use Zeppelin html display system to show the html content as the notebook's output. With the '%html' directive, Zeppelin treats your output as HTML.

from pandas_profiling import ProfileReport
# read data
df = read("Data_Profiler.Titanic_train")
# convert spark data frame to pandas data frame
data = df.toPandas()
# generate the report
profile = ProfileReport(data, title="Pandas Profiling Report", explorative=True)
# with '%html' directive treats output as html
# comment out this line before save 
print("%html " + profile.to_html())
save(df)

Screen Shot 2022-05-08 at 3.45.21 PM.png

Tips:

1. Before you run, treat the output as HTML. You can link the paragraph to HTML. Then you can visit the profile report as HTML.

2. To save your memory, comment out the output as HTML before committing.

# comment out this line before save 
print("%html " + profile.to_html())

Please note that the output is saved when you are saving the notebook. If the output content is too large, you may get an error message when you save the notebook. You can remove the output from the notebook before you save the notebook.