Data Load Strategy

RichC · ‎03-08-2022

Introduction

This article discusses the procedures and best practices for Data Load Strategy.

What you should know before reading this article

We recommend that you be familiar with these Incorta concepts before exploring this topic further.

Applies to

These concepts apply to all releases of Incorta.

Let's Go

Physical Schema Design

Single or Multiple Schemas?

Put related tables which are frequently shown together within the same dashboard in the same schema.

When Incorta loads or refreshes data, it loads by schema. A job can be scheduled for a schema and all tables defined in the schema will be extracted and loaded. Joins are calculated as part of the job.

Should you put joined tables in one schema?

While tables that are frequently joined together should be placed in the same schema, cross schema joins are supported by Incorta. When a parent table is not updated frequently and many tables that are not closely related to each other are joined to the same parent table, it is fine to split these child tables into different schemas.

For example, the GL account structure does not change often and many functional areas have references to GL accounts. You can place the GL Account structure into a schema that is joined from two unrelated schemas that hold the child tables. The benefit of this strategy is to make the data load finish in a shorter period with a smaller number of tables in a data load job.

Is there a limit on the size of schema?

Generally speaking, it is a good idea to distribute tables into different schemas by areas of interest (or applications) and schemas smaller as a rule. This will make it easier to find tables you are looking for, and will reduce the total number of joins overall.

The size of your schemas is physically limited based on the hardware available to the loader service. The requirement is that primary key columns for all tables in the same schema have to fit in the available loader memory. This limitation should be considered if you have multiple tables with billions of rows in the same schema. This scenario might require a larger memory allocation to the loader service than is typical.

When a schema is refreshed in the loader service and synchronized into other services, the data from the impacted schema may be evicted from memory. The bigger the schema, the more space and time is required during this process.

Should you consider data sharing and data security during schema design?

Place tables that can be shared with the same group of users in the same schema. Schemas are shared with analyzers for building dashboards. Schemas are also shared among schema managers for maintaining schema definitions and loading and monitoring data load jobs.

The schema is the unit for data sharing. When a user is granted permission to access a schema, the user can see all tables. Tables cannot be shared individually. In the case of business schemas, business views cannot be shared individually.

If the data extracted into a table is being used in a data security filter, make sure that the table is refreshed as often as is acceptable to reflect the data security permission changes. Incorta data security filters are evaluated at runtime. When a data security filter involves a permission grant from a data source, data security changes will be delayed until the related table is refreshed in Incorta.

For example, you could extract your management hierarchy from a data source and use it to secure data in Incorta. This hierarchy could be used by Incorta to allow managers to view personal information about only their subordinates. Security in Incorta will not reflect data source security changes until the management hierarchy is refreshed from the source.

Should a schema contain data from multiple data sources?

Incorta schemas do not have to match the database schema or “database” defined in the source system. You can have one table source the data from data source A and have another table sourced from data source B.

Put tables that will be joined together or are closely related into the same schema. This way you can ensure data consistency.

There is a benefit top putting materialized views and their source tables in the same schema. In this situation, Incorta will sequence the materialized views to populate after the source tables have loaded, thus guaranteeing that the MVs have the freshest data. Even when the source tables are from different data sources, it is beneficial to put them into one Incorta schema.

Incorta also allows you to put data from different data source into the same table. You can create multiple data sets with the same structure with each data set connected to its own data source. These data sets will be executed by different workers and can be executed in parallel.

Here are constraints that you should consider:

If you extract data from one data set with incremental logic, you need to define incremental logic for all data sets. You cannot extract the data from one data set with incremental logic while extracting all data from another data set assuming the old data will be removed before loading.
When you are using the Incremental strategy of using the MAX value from the target, please note that the MAX value is from all data sets. If different data sets have different ranges and thus a different MAX value, do not use this method.
When the job is scheduled to run, Incorta will extract and refresh data for all data sets. You cannot schedule the data sets to be refreshed at different intervals.
Starting with release (4.8), different data sets may have different columns, but you should carefully select columns with the right column name or column alias to be mapped to the same target column
Only columns with the same data type will be map to the same target column. If data type is different in the source system, you will need to convert the data into an appropriate data type in the extraction query.

Incorta will enforce uniqueness based on the key defined and will discard the old records with the same key and replace it with newly extracted records, regardless of which data set the data is extracted from.

Use a column with a constant in the extraction query to indicate where the data comes from. This can help with debugging if you face a data issue in loading a table with multiple sources.

Use Multiple data sets to break up a large SQL if the Incorta chunking feature is not available.

Chunking

Chunking is supported when the data source is Oracle or MySQL. You can use the Chunking feature to break a large extraction query into smaller chunks which will be executed in parallel. Here are the considerations:

You should have sufficient workers to run many extraction queries at the same time
You should have enough database connection pool to accept multiple queries running against the database source

Incremental Logic

Incremental logic helps to reduce the volume to be extracted and processed and, in general, helps the performance of refreshing data.

How incremental logic works in Incorta

When the “Last Successful Extraction Time” method is being used, Incorta stores the time when the extraction of a table starts at the table level in a file called time.log file on the server under the Incorta tenant folder, one file for each schema. The file is updated when the extraction of the table finishes successfully.

The value is stored in the epoch time format independent from the time zone. The value is used in the next extraction of the table to replace the "?" specified in the SQL statement. It becomes the bind variable to the JDBC and is converted to the timestamp value of the specific database platform.

Alternatively, depending on the connector, the method Maximum Value of [LAST_UPDATE_DATE] method may be available.

The LAST_UPDATE_DATE is a column that can be used to identify changed or newly created records. It has to be selected as part of the Extraction query. Incorta will store its maximum value, that is the last record(s) that was extracted into Incorta storage and use its value to get the data that are changed or created after that time during the subsequent extraction. Using this method can provide a more reliable result if there could be late arrival data in the source system.

Here are some best practices:

Test your incremental logic with a source system that has an incremental data feed.
The incremental logic is based on the audit columns, such as LAST_UPDATE_DATE. The typical logic is “LAST_UPDATE_DATE > ?”
When you are using the “Last Successful Extraction Time” method, there can be no delay of data arrival in the source database.
Typically, if the source database is a backup or a replicated database, some delay will exist. Please use the Maximum Value of [LAST_UPDATE_DATE] method if it is available in the data source. As of Incorta 4.7, Oracle and MySQL databases are supported with this method.
Please add prune time into the incremental logic if you have following conditions:

You are using a backup or a replicated database.
The time zone differs between the Incorta server time zone and the source database time zone.
The last update date value is not very reliable for other reasons

PostgreSQL prune time offset example :

WHERE last_update_date + INTERVAL '7 HOURS' >= ?