on 06-09-2022 09:24 AM
When designing your physical schemas the easiest construction to follow, especially if you are connecting to a common data source such as a database, is to create a schema that contains all of the tables that are loaded from that data source. In that way, all of the data from a common source is grouped together for loading into Incorta. Many times users will end up having one schema for each of their data sources and this is quite common. However, with a bit of forethought and keeping in mind the type of reporting that is the goal for your project, having schemas that merely follow your data sources may NOT be the optimal approach for achieving the desired results. You may need to add or alter your schema designs to optimally fit the desired reporting goals. Many times users are bringing in data from multiple sources, wishing to join, consolidate, and aggregate that disparate data into meaningful, related reporting structures that really represent the desired “reporting application”. Therefore, schemas need to take into account many factors including loading frequency, loading types (full or incremental), cross schema dependencies (both from join and load perspectives), and data aggregation and transformation techniques that are required to shape the data into the formats needed for reporting.
We recommend that you be familiar with these Incorta concepts before exploring this topic further.
These concepts apply to all releases of Incorta
The power of Incorta is the ease with which data from disparate sources can be brought into incorta and immediately be available for reporting. However, just because the data has been loaded into Incorta does not mean the data will represent both the structures and relationships needed to allow data from one data source to be reported on along side another data source For example, perhaps a user wants to combine financial data from Oracle EBS with sales planning and opportunity information from Salesforce and shipping and warehouse data that exists within a company’s home grown warehousing application. Time and effort is needed to understand how these data sources are in fact related. Each of these systems may have constructs for Customer / Account and Product that allow those systems to connect to their respective transactions internally. Master tables may need to be constructed that relate the mapping between systems to allow the different systems to be joined together for reporting purposes.
Therefore, after importing all of the base tables from each individual data source, and then creating all of the joins between the tables of each data source to create the data models supporting the table structures within each data source, it may very well be necessary to create new schemas to begin constructing those master tables that can be used as common dimensions for consolidated reporting across multiple data sources (Common_Customer, Common_Product, Common_User, etc.). These common dimension tables can then be joined to each data source schema creating the “bridges” that will allow information from one data source to be logically and reliably joined to data from another previously unavailable set of data. Subsequently, if attributes from these master dimensions are then used when constructing reports or used as fields in business views that are used to construct reports, then Incorta will know how to construct the join paths between the systems to get the correct transactional data from each data source.
There are many considerations to take into account when loading data. The first consideration is for how tables within a single schema will load. Key factors include:
In Incorta, tables are loaded by schema and load schedules are created for each schema. Therefore it is important to remember that ALL tables within a schema will be loaded for every schema load performed. Understanding the frequency for which it is necessary to load the data is a factor in schema design. For example, if a particular source system is relatively small, loads relatively quickly, and perhaps updates are only needed daily, then scheduling a FULL load of that schema once per day is likely a perfectly acceptable solution for loading that schema. On the other side of the spectrum, if reporting requires transactions from the source system to be made available for reporting in near real time, then implementing an approach for incremental loading of key transaction tables will be required to load the schema as often as every few minutes. In this case, other factors including key dimension tables, size of underlying tables, and size and desired speed of incremental loads, etc. may determine if a schema should be split into groups. Perhaps some of the tables in the application are support tables, change rarely, and can be reloaded less frequently, while some tables change with each transaction processed in the application. In this case, one schema could be run 1 time per day for support or dimensional tables, while another schema containing incremental tables could be running every few minutes to pull the most recent transactions for reporting.
There is no exact science to this process. All the tables for both Full and Incremental load types can remain in the same schema and can be scheduled to run every hour, for example, if the total time to load all of those tables is acceptable to the users. And many times there are advantages to allowing all “related” tables to remain in the same schema.
Each table within a schema, by default, will have a Full Load behavior. And if a schema is loaded then ALL the records from the data source for that table will be loaded as per the SQL that is defined, minus any rows excluded due to the application of a static load filter. For example, perhaps the source system has 100 million sales transactions going back 10 years. However, if the reporting requirement were to be that only transactions inclusive of the last 5 years are needed for reporting, then a load filter might be used such that each time the table is loaded only 5 years of data is pulled. But, since this is a Full Load, it means that each time the table is loaded, it is fully replaced, and a new parquet file is written. Incremental loads function in a way to avoid reloading the full table with each schema load. However, in order to accomplish this task, it is necessary for the system to know which attribute within the table will define which rows have changed or which rows are new. This can often be accomplished using “created on” or “updated on” date-time columns that exist within the table, or in some cases by using Max() values for an ID column, etc. Operationally, when an incremental load for a table is run, it can identify which rows have changed or been inserted in the table since the last time the schema was loaded. Incorta maintains this record of “when” a schema was last loaded such that the condition can be created to load the data “WHERE created on > last load date or updated on > last load date”. The mandatory conditions for accomplishing incremental loads include:
Notes:
Within a schema Incorta will begin to extract and load many tables in parallel. This is partly controlled by the connection pooling attribute set in the data source connection. However, that being said Incorta is intelligent enough to know table dependencies that are evident in the data model due to parent and child relationships that exist due to joins created between tables, etc. However, there are other things to know about controlling inside a schema the order in which various tables are loaded:
Note: The discussion above refers only to load order within a single schema. However, sometimes it is necessary to understand and manage load dependencies among multiple schemas. Currently, the only way to manage the dependency between Schema-A and Schema-B, where it is known ahead of time that Schema-A must run before Schema-B is to schedule Schema-B to start its load at a later time than Schema-A. Incorta will still understand if dependencies exist, and will place Schema-B in “Queued” status if it attempts to load while Schema-A is still loading. In multi source applications, the construction of the load orders of multiple Schemas must take into account these known dependencies prior to scheduling the jobs that will run the loads.
Many times it is necessary to apply some business logic to create different views of data for reporting. Perhaps the need is to create views of data that summarize, divide, or group data differently than what is provided by the base application tables or using specific attribute logic that defines the contents of the derived table. Incorta supports Many options to the developer to create derived tables, and they include:
Incorta-Over-Incorta Tables: The use of SQL written against tables already residing in Incorta performing whatever logic is required for the new table. This method uses Incorta’s internal PostgreSQL connection to run the SQL.
Materialized Views using Spark to create the derived table using one of the following techniques;
Incorta Analyzer Tables: Tables built using the same logic and methods provided when building insights for dashboards, with the results being save as a table.
Coming in Version 5 and after, the ability to create an native Incorta SQL table that is executed by the Incorta Engine using native Incorta tables and the power of direct data mapping, but using SQL to create the tables but retain the benefits and speed of the Incorta DDM mapping and engine. Like Incorta Analyzer tables, these table also reside only in memory.
Note: With the recent additions of PostgreSQL support and the upcoming support for Incorta SQL tables, it will soon be unnecessary to use the Incorta-Over-Incorta method of creating tables, as it is the slowest method for loading tables via SQL.
Many times, there is some confusion regarding where aggregation tables should be placed in schemas as they relate to the transactional tables on which they depend.
First a discussion on transactional tables. Transaction tables are typically the “child” of various “parent” tables For example, an Orders table may be the child of Customer, Order_Date, Product, and/or Region dimensional tables. In traditional data warehouse ETL modeling & loading, it is necessary to load all the dimension tables first such that all child records from the Orders table will be guaranteed to have a positive key match to each of its parent tables. In Incorta there may be some small benefit to separating master (parent) and transaction tables into separate schemas depending on how often or how frequently master and transaction tables are updated in the source application. For example, if it is known that the application from which the data comes will only update parent tables once per day, then a schema could be created to load those tables once per day. While if it is known that Orders are updated hourly, then there is no need to reload master tables hourly, and so Order transactions could be placed in a schema that runs on the appropriate update frequency. That being said, there is no harm in reloading parent tables more frequently, especially if there is no significant impact on load performance. Since incorta is “aware” of the parent to child relationships that exist in the DDM model, it will automatically manage if joins or PK index objects need to be rebuilt or updated based on the relationships that have been established between the tables in the schema.
Aggregation tables, on the other hand, could be running SQL or code (Spark python, etc.) to construct a new view of transactions based on the desired logic and the relationships existing between parents and children from one or more schemas. It is important to know requirements of each aggregation table, the kind of aggregation table chosen, and the potential intra and inter schema dependencies that exist for an aggregation table. A quick recap of aggregation table types:
When considering how each of these aggregation tables load in the context of a schema load, the following rules exist:
Notes:
Incorta provides a data type inference process. For SQL based data source, the data type will be defined based on the data types used in JDBC implementation. For CSV files, the data type will be generated based on a sample of the data itself. Please review and adjust after the schema table definition is complete and adjust the defaulted types if necessary. You can change the data type defined for a column in the Incorta schema definition, but a full load is then required to update the table (and parquet file) if the data is already used in formula column or other downstream data process. Please verify and confirm the data type as early in the design process as possible to avoid changing later. Once the data type is manually overridden, Incorta will not try to use the data type from the data type inference process to change it again for those existing columns. However, if you change the column name by changing the column alias in the SQL statement of the extract query, the data type will be defaulted by Incorta.
Int vs Long
Double vs Long (or Int)
Timestamp vs Date
Converting the data into a specific data type in extraction query vs in Incorta
Text data type
Note for Oracle Applications and Oracle database about date type DATE:
Oracle databases support both the date type of Date and Timestamp. When Oracle database RDBMS stores data with the data type DATE, it stores the time component as well.
However, Incorta Schema Wizard and SQL data type inference process, by default, creates those columns from the Oracle database with the date type DATE as Timestamp columns in Incorta.
Oracle Apps use the data type DATE when the time is not significant. Please use the implicit casting by changing the data type after the extract SQL is defined or the table is generated from schema wizard.
Data type conversion functions frequently used in Incorta
Just as in any database system, a table in Incorta should have a Key defined. This can be a single column or multiple columns, but the fields should define what will result in a unique row of data in the table by using this Key. Incorta will use this information in the following ways:
Note: Regarding Joins on Key: Often times it is necessary to define an Alternate Key for a table in order to accomplish a Join to a different Table for a legitimate business reason. This is the perfect opportunity to create an Alias on the base table, such that an alternate key and join can be constructed to the table.
Column width in Incorta is not explicitly defined. Only the data type (long, double, Int, string) are defined as part of the table definition process as described above. If there is a business reason to test, check, or validate the length of a string for some business process or transformation rule, it is possible to use various string built in functions to check length() and other related string function which may be implemented based on the result of the test. For example:
if(
length(batch_number) < 10,
concat(
repeat(
"0",
10 - length(batch_number)
),
batch_number
),
batch_number
)
In the example above, if the length of Batch Number is < 10, then it will Concatenate some number of "0" to the front of the Batch Number such that the length of the resulting string is 10. (ie: 123456 results in 0000123456).
Situations, as described in the example above, are often used to create the ability to join between disparate data sources as well. Using the example, if System-A stored batch number as an integer 123456, and System-B stored batch number as a string of "0000123456", then a formula column could be constructed as above such that the 2 systems can be related by Batch Number (and assuming the users are used to "seeing" batch number in the 10-digit format in reports.
Sometimes when creating a join to a supporting table it is necessary to use conditions other than just the required join columns designating the key or business key. In such cases, it is possible to add additional conditions to the join. Here are a couple common examples:
Example 1: Choose from Among multiple record Types.
Country | Description | Language |
USA | United States | E |
USA | Etats Unis | F |
USA | Vereinigte Staaten | G |
A table of Country Codes to Country Names exists containing Country Names for each Country Code in many languages. If the desire is to only pull Country Names with English spelling for reporting, the Join can be constructed as follows:
Child Table Schema1.Transactions
Parent Table Schema1.Country_names
Schema1.Transactions.country.code = Schema1.Country_names.Country
Schema1.Country_names.Language = E
Note: The column used to create the additional condition must be a column existing in the Parent table.
Example 2: Choose from Among multiple effective records.
Country | Effective_Dt | Expiration_Dt | Exchange_Rate |
CAN | 1/1/1990 | 12/31/1998 | 1.2 |
CAN | 1/1/1999 | 12/31/2010 | 1.22 |
CAN | 1/1/2011 | 12/31/2099 |
1.25 |
A table of Exchange Rates exists containing the conversion rate to the Dollar for each Country Code for different periods of time. If the Desire is to only pull Exchange rate to calculate US Dollars for the date associated with the Transaction, the Join can be constructed as follows:
Child Table Schema1.Transactions
Parent Table Schema1.Exchange_Rates
Schema1.Transactions.country.code = Schema1.Exchange_Rates.Country
Schema1.Transactions.Transaction_date >= Schema1.Exchange_Rates.Effective_Dt
Schema1.Transactions.Transaction_date <= Schema1.Exchange_Rates.Expiration_Dt
Note: alternatively, in the example above, the join could just use Effective_Dt for the condition, since there would still be a single result using only the one date. However, in some situations it is necessary to use a range if the Transaction itself identifies a range of time.