cancel
Showing results for 
Search instead for 
Did you mean: 
dylanwan
Employee
Employee

Overview

This article introduces you to Incorta Data Profiler.  You will learn what an Incorta Profile is, why you should use Incorta Data Profiler, and what the steps are for analyzing your data with the Data Profiler dashboard.

What is Incorta Data Profiler? 

Incorta Data Profiler is an Incorta data app that can be used to examine your data before you make use of it within Incorta. It provides the structure of the data and a content summary of the data that helps you gain insight into data quality.  It generates descriptive statistics for columns, and show their distributions and interrelationships. 

You first run the schema refresh job to gather summary data.  You will then view the data profiling results from an Incorta Dashboard.

Table Based Profiling

Each time the data profiler is run, it summarizes the data profile of a single table.  You can run the data profiler process multiple times for different tables, but still, you will only be able to look at the profile data of one table at a time.

Spoiler
We do not yet deal with the relationships between two tables. It is, however, in our roadmap to compare the data between two tables and this feature will be available at a later time.

From Table Level to Column Level

You can examine profile data at the table level, which provides a breakdown of the numbers of columns under different categories, including numeric, categorical and date.  The profile also provides many standard metrics such as missing values, zero counts, unique values, and so on.

For numeric data, you can view descriptive statistics. These include average, maximum, minimum, standard deviation, and variance.  You can see the histogram in the detail tab.
For categorical, you can view data items by frequency.

It also provides information about column relationships. It calculates the correlation, which shows how numerical columns are related to each other. This can help you make decisions in data science and machine learning projects when you want to determine if features can replace each other or how a feature is related to a label, and may be useful for prediction.

Why Incorta Data Profiler? 

Incorta Data Profiler is not the only data profiling tool you can use with Incorta. Actually, you can see similar content if you open or download the Pandas-Profiling tool or the Sweetviz profiler.  Below are some features of Incorta Data Profiler that will help you understand its value.

Scalable via Apache Spark

Incorta Data Profiler is a scalable data profiler.  All the calculations and data summarization are processed in Apache Spark.  It leverages Incorta's  Materialized View and Data enrichment transformation framework.

Highlight with Conditional Formatting

Incorta Data Profiler helps you to focus by highlighting what is significant, such as which columns have high cardinality and are therefore potentially primary key columns of the table, and also which columns have a lot of missing values or zero values, and which columns have outliers.

Incorta Data Profiler does not just calculate correlation numbers. When you view correlation information, you will see that the columns with high correlation are highlighted in a heat map.

Drill Down to the Column Details

You can see the table summary in one tab of the Incorta Profiler Dashboard. You will go to see the details in the other tab.

Incorta Data Profiler Dashboard

Features

The Incorta Data Profiler Dashboard provides the following features:

  • View the table info (# of columns, # of rows, # of duplicates etc)
  • Columns that are unique
  • Columns with missing values
  • Columns with zero values
  • High cardinality (distinct count greater than 100)
  • String length
  • Columns with high correlation, i.e. two columns that are closely correlated
  • Column detail by each column(include column detail, box plot, frequent items, and histogram)

Filter By Table Name

You first select the table to look at the data profile.

2.gif

Review the highlighted data

Unique columns are highlighted and you can use this information to define the key for the table.
Columns with missing or zero values are highlighted and indicate that you may need to deal with the missing data prior to importing it into Incorta.

Drill from Summary Tab to Column Details

 
3.gif

 

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Contributors
Version history
Last update:
‎12-12-2023 03:57 PM
Updated by: