This article discusses the procedures and best practices for Data Load Strategy.
We recommend that you be familiar with these Incorta concepts before exploring this topic further.
These concepts apply to all releases of Incorta.
Single or Multiple Schemas?
Put related tables which are frequently shown together within the same dashboard in the same schema.
When Incorta loads or refreshes data, it loads by schema. A job can be scheduled for a schema and all tables defined in the schema will be extracted and loaded. Joins are calculated as part of the job.
Should you put joined tables in one schema?
While tables that are frequently joined together should be placed in the same schema, cross schema joins are supported by Incorta. When a parent table is not updated frequently and many tables that are not closely related to each other are joined to the same parent table, it is fine to split these child tables into different schemas.
For example, the GL account structure does not change often and many functional areas have references to GL accounts. You can place the GL Account structure into a schema that is joined from two unrelated schemas that hold the child tables. The benefit of this strategy is to make the data load finish in a shorter period with a smaller number of tables in a data load job.
Is there a limit on the size of schema?
Generally speaking, it is a good idea to distribute tables into different schemas by areas of interest (or applications) and schemas smaller as a rule. This will make it easier to find tables you are looking for, and will reduce the total number of joins overall.
The size of your schemas is physically limited based on the hardware available to the loader service. The requirement is that primary key columns for all tables in the same schema have to fit in the available loader memory. This limitation should be considered if you have multiple tables with billions of rows in the same schema. This scenario might require a larger memory allocation to the loader service than is typical.
When a schema is refreshed in the loader service and synchronized into other services, the data from the impacted schema may be evicted from memory. The bigger the schema, the more space and time is required during this process.
Should you consider data sharing and data security during schema design?
Place tables that can be shared with the same group of users in the same schema. Schemas are shared with analyzers for building dashboards. Schemas are also shared among schema managers for maintaining schema definitions and loading and monitoring data load jobs.
The schema is the unit for data sharing. When a user is granted permission to access a schema, the user can see all tables. Tables cannot be shared individually. In the case of business schemas, business views cannot be shared individually.
If the data extracted into a table is being used in a data security filter, make sure that the table is refreshed as often as is acceptable to reflect the data security permission changes. Incorta data security filters are evaluated at runtime. When a data security filter involves a permission grant from a data source, data security changes will be delayed until the related table is refreshed in Incorta.
For example, you could extract your management hierarchy from a data source and use it to secure data in Incorta. This hierarchy could be used by Incorta to allow managers to view personal information about only their subordinates. Security in Incorta will not reflect data source security changes until the management hierarchy is refreshed from the source.
Should a schema contain data from multiple data sources?
Incorta schemas do not have to match the database schema or “database” defined in the source system. You can have one table source the data from data source A and have another table sourced from data source B.
Put tables that will be joined together or are closely related into the same schema. This way you can ensure data consistency.
There is a benefit top putting materialized views and their source tables in the same schema. In this situation, Incorta will sequence the materialized views to populate after the source tables have loaded, thus guaranteeing that the MVs have the freshest data. Even when the source tables are from different data sources, it is beneficial to put them into one Incorta schema.
Incorta also allows you to put data from different data source into the same table. You can create multiple data sets with the same structure with each data set connected to its own data source. These data sets will be executed by different workers and can be executed in parallel.
Here are constraints that you should consider:
Incorta will enforce uniqueness based on the key defined and will discard the old records with the same key and replace it with newly extracted records, regardless of which data set the data is extracted from.
Use a column with a constant in the extraction query to indicate where the data comes from. This can help with debugging if you face a data issue in loading a table with multiple sources.
Use Multiple data sets to break up a large SQL if the Incorta chunking feature is not available.
Chunking is supported when the data source is Oracle or MySQL. You can use the Chunking feature to break a large extraction query into smaller chunks which will be executed in parallel. Here are the considerations:
Incremental logic helps to reduce the volume to be extracted and processed and, in general, helps the performance of refreshing data.
How incremental logic works in Incorta
When the “Last Successful Extraction Time” method is being used, Incorta stores the time when the extraction of a table starts at the table level in a file called time.log file on the server under the Incorta tenant folder, one file for each schema. The file is updated when the extraction of the table finishes successfully.
The value is stored in the epoch time format independent from the time zone. The value is used in the next extraction of the table to replace the "?" specified in the SQL statement. It becomes the bind variable to the JDBC and is converted to the timestamp value of the specific database platform.
Alternatively, depending on the connector, the method Maximum Value of [LAST_UPDATE_DATE] method may be available.
The LAST_UPDATE_DATE is a column that can be used to identify changed or newly created records. It has to be selected as part of the Extraction query. Incorta will store its maximum value, that is the last record(s) that was extracted into Incorta storage and use its value to get the data that are changed or created after that time during the subsequent extraction. Using this method can provide a more reliable result if there could be late arrival data in the source system.
Here are some best practices:
WHERE last_update_date + INTERVAL '7 HOURS' >= ?