on 08-16-2022 03:17 PM - edited on 12-12-2023 03:57 PM by Tristan
This article introduces you to Incorta Data Profiler. You will learn what an Incorta Profile is, why you should use Incorta Data Profiler, and what the steps are for analyzing your data with the Data Profiler dashboard.
Incorta Data Profiler is an Incorta data app that can be used to examine your data before you make use of it within Incorta. It provides the structure of the data and a content summary of the data that helps you gain insight into data quality. It generates descriptive statistics for columns, and show their distributions and interrelationships.
You first run the schema refresh job to gather summary data. You will then view the data profiling results from an Incorta Dashboard.
Each time the data profiler is run, it summarizes the data profile of a single table. You can run the data profiler process multiple times for different tables, but still, you will only be able to look at the profile data of one table at a time.
You can examine profile data at the table level, which provides a breakdown of the numbers of columns under different categories, including numeric, categorical and date. The profile also provides many standard metrics such as missing values, zero counts, unique values, and so on.
For numeric data, you can view descriptive statistics. These include average, maximum, minimum, standard deviation, and variance. You can see the histogram in the detail tab.
For categorical, you can view data items by frequency.
It also provides information about column relationships. It calculates the correlation, which shows how numerical columns are related to each other. This can help you make decisions in data science and machine learning projects when you want to determine if features can replace each other or how a feature is related to a label, and may be useful for prediction.
Incorta Data Profiler is not the only data profiling tool you can use with Incorta. Actually, you can see similar content if you open or download the Pandas-Profiling tool or the Sweetviz profiler. Below are some features of Incorta Data Profiler that will help you understand its value.
Incorta Data Profiler is a scalable data profiler. All the calculations and data summarization are processed in Apache Spark. It leverages Incorta's Materialized View and Data enrichment transformation framework.
Incorta Data Profiler helps you to focus by highlighting what is significant, such as which columns have high cardinality and are therefore potentially primary key columns of the table, and also which columns have a lot of missing values or zero values, and which columns have outliers.
Incorta Data Profiler does not just calculate correlation numbers. When you view correlation information, you will see that the columns with high correlation are highlighted in a heat map.
You can see the table summary in one tab of the Incorta Profiler Dashboard. You will go to see the details in the other tab.
The Incorta Data Profiler Dashboard provides the following features:
You first select the table to look at the data profile.