cancel
Showing results for 
Search instead for 
Did you mean: 
suxinji
Employee Alumni
Employee Alumni

Introduction

This article will show you how to adjust the output of the Incorta data profiler with different parameters. 

Incorta Profiler includes five materialized views. Each of them calls a different API that extracts or summarizes the data. You can call the APIs with different options. We will go over them one by one.  

  • table_info
  • summary_table
  • correlation_table
  • freq_item
  • histogram

Calling DataProfiler API

Basic Table Info

This MV shows the basic information about the table structure.

The output columns: 

['table_name', 'NUM_OF_ROWS', 'NUM_OF_COLUMNS', 'NUM_OF_DUPLICATES', 'NUM_OF_NUMERICAL_COLUMNS', 'NUM_OF_CATEGORICAL_COLUMNS', 'NUM_OF_DATETIME_COLUMNS']

In table_info MV, we called the table_info() function

Require parameters: 

  • input_df: data frame
  • table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter

Optional parameters: 

  • column_list: provide a list of columns name to filter the output. 
Spoiler
Reducing the columns can speed up the process of running the schema refresh for large data set.

suxinji_3-1659658133469.png

Summary Table

This MV calculates descriptive statistics and summary statistics of an entire data frame. 

The output columns: 

[‘table_name’, ‘column_name’, ‘count’, ‘mean’, ‘stddev’, ‘min’, ‘p25’, ‘p50’, ‘p75’, ‘max’, ‘missing_count’, ‘missing_pct’, ‘unique_count’, ‘unique_pct’, ‘dtype’, ‘variance’, ‘zero_count’, ‘zero_pct’, ‘kurt’, ‘skew’, ‘min_length’, ‘max_length’, ‘IQR’, ‘range’]

  • count: count the number of rows with not null value
  • stddev: standard deviation, a measure of the amount of variation or dispersion of a set of values
  • p25, p50, p75: the 25%, 50%, and 75% quartiles.  50% quartile is the same as median.
  • missing_count: number of rows with null value for the column
  • unique_count: number of distinct values 
  • zero_count: number of rows with the value zero in the column
  • min_length: the minimum length as the number of digits
  • max_length: the maximum length as number of digits
  • variance: a measure of how data points differ from the mean
  • zero_count: the number of zero values
  • kurt: kurtosis, a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
  • skew: skewness, a measure of the asymmetry of a distribution
  • IQR: interquartile range, IQR = 75 percentile - 25 percentile
  • range: maximum-minimum

In summary_table table, we called summary_table function

Require parameters: 

  • input_df: data frame 
  • table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter

Optional parameters: 

  • percentiles: provide a list of percentiles specified as a percentage (eg, 5%, 10%, etc)
  • column_list: provide a list of columns name to filter the output 

suxinji_0-1659656655787.png

Correlation Table

This MV shows each numeric column with the other numeric column's correlation. 

The output columns: 

['table_name', 'Feature1', 'Feature2', 'Correlation']

  • Feature1: the column name 1
  • Feature2: the column name 2
  • Correlation: the correlation between column name 1 and column name 2

In correlation_table table, we called numerical_cols and correlation_table functions.  Only the columns with the numerical data types will be used.

numerical_cols function

Parameter:

  • input_df: data frame

You can review the columns that will be used and further remove the unwanted columns from the list.

correlation_table function 

Require parameters: 

  • input_df: data frame
  • table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter

Optional parameters: 

  • pool_size: a thread pool maintains multiple threads for concurrent execution, pool size default is 10, can only between 1-40.  If you have a large data set and have multiple cores, you can adjust this parameter to a larger value.
  • column_list: provide a list of columns name to filter the output 

suxinji_1-1659657318751.png

Frequency Table

This MV shows the frequency of each value in a column. 

The output columns: 

['table_name', 'column_name', 'value', 'row_count', 'freq_pct']

  • value: a column value
  • row_count: the number of rows with the value
  • freq_pct: percentage of the rows with the the value over the total number of rows

In freq_item table, we called freq_item_multi_cols function

Require parameters: 

  • input_df: data frame
  • table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter 

Optional parameters: 

  • limit: Limit the number of the values shown in the output. The values are sorted by the frequency and the values with low frequency are dropped. The limit default is 50, and can be between 1-500
  • column_list: provide a list of columns name to filter the output 

suxinji_0-1659656167348.png

Histogram Table

This MV shows the value distribution of columns. You can go through the histogram table to see the row count for each range of values. 

The output columns: 

['table_name', 'column_name', 'bucket', 'row_count']

  • bucket: bucket value(10 buckets total) is for defining the range of values
  • row_count:  the number of rows that is within the range

In the histogram table, we called the histogram_table function

Require parameters: 

  • input_df: data frame
  • table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter

Optional parameters: 

  • column_list: provide a list of columns name to filter the output 

suxinji_2-1659658030139.png

Incremental Load

Multiple Datasets

Incorta Profiler let you to add multiple tables and switch between them if you have multiple tables.

In your schema, enable Incremental, and add your tables to your scripts. Call the DataPrepAPI for each table and use unionAll to combine the results into the same MV.

w.png

For example, replace the highlight sample codes following by format [SCHEMANAME].[TABLENAME], using the union all to save the data frame. 

eeee.png

Go to your Dashboard, and you can choose your the table you would like to explore.

ww.png

Related Materials

Comments
JoeM
Community Manager
Community Manager

This is great @suxinji !! Thanks for sharing the knowledge!

Best Practices Index
Best Practices

Just here to browse knowledge? This might help!

Version history
Last update:
‎08-19-2022 06:49 AM
Updated by: