This website uses Cookies. Click Accept to agree to our website's cookie use as described in our Privacy Policy. Click Preferences to customize your cookie settings.
This article will show you how to adjust the output of the Incorta data profiler with different parameters.
Incorta Profiler includes five materialized views. Each of them calls a different API that extracts or summarizes the data. You can call the APIs with different options. We will go over them one by one.
table_info
summary_table
correlation_table
freq_item
histogram
Calling DataProfiler API
Basic Table Info
This MV shows the basic information about the table structure.
Correlation: the correlation between column name 1 and column name 2
In correlation_table table, we called numerical_cols and correlation_table functions. Only the columns with the numerical data types will be used.
numerical_cols function
Parameter:
input_df: data frame
You can review the columns that will be used and further remove the unwanted columns from the list.
correlation_table function
Require parameters:
input_df: data frame
table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter
Optional parameters:
pool_size: a thread pool maintains multiple threads for concurrent execution, pool size default is 10, can only between 1-40. If you have a large data set and have multiple cores, you can adjust this parameter to a larger value.
column_list: provide a list of columns name to filter the output
Frequency Table
This MV shows the frequency of each value in a column.
freq_pct: percentage of the rows with the the value over the total number of rows
In freq_item table, we called freq_item_multi_cols function
Require parameters:
input_df: data frame
table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter
Optional parameters:
limit: Limit the number of the values shown in the output. The values are sorted by the frequency and the values with low frequency are dropped. The limit default is 50, and can be between 1-500
column_list: provide a list of columns name to filter the output
Histogram Table
This MV shows the value distribution of columns. You can go through the histogram table to see the row count for each range of values.
bucket: bucket value(10 buckets total) is for defining the range of values
row_count: the number of rows that is within the range
In the histogram table, we called the histogram_table function
Require parameters:
input_df: data frame
table_name: the table name will be shown in the Data Profiler dashboard as a mandatory filter
Optional parameters:
column_list: provide a list of columns name to filter the output
Incremental Load
Multiple Datasets
Incorta Profiler let you to add multiple tables and switch between them if you have multiple tables.
In your schema, enable Incremental, and add your tables to your scripts. Call the DataPrepAPI for each table and use unionAll to combine the results into the same MV.
For example, replace the highlight sample codes following by format [SCHEMANAME].[TABLENAME], using the union all to save the data frame.
Go to your Dashboard, and you can choose your the table you would like to explore.