Designing your Physical Schema
When designing your physical schemas the easiest construction to follow, especially if you are connecting to a common data source such as a database, is to create a schema that contains all of the tables that are loaded from that data source. In that way, all of the data from a common source is grouped together for loading into Incorta. Many times users will end up having one schema for each of their data sources and this is quite common. However, with a bit of forethought and keeping in mind the type of reporting that is the goal for your project, having schemas that merely follow your data sources may NOT be the optimal approach for achieving the desired results. You may need to add or alter your schema designs to optimally fit the desired reporting goals. Many times users are bringing in data from multiple sources, wishing to join, consolidate, and aggregate that disparate data into meaningful, related reporting structures that really represent the desired “reporting application”. Therefore, schemas need to take into account many factors including loading frequency, loading types (full or incremental), cross schema dependencies (both from join and load perspectives), and data aggregation and transformation techniques that are required to shape the data into the formats needed for reporting.
We recommend that you be familiar with these Incorta concepts before exploring this topic further.
These concepts apply to all releases of Incorta
The power of Incorta is the ease with which data from disparate sources can be brought into incorta and immediately be available for reporting. However, just because the data has been loaded into incorta does not mean the data will represent both the structures and relationships needed to allow data from one data source to be reported on along side another data source For example, perhaps a user wants to combine financial data from Oracle EBS with sales planning and opportunity information from Salesforce and shipping and warehouse data that exists within a company’s home grown warehousing application. Time and effort is needed to understand how these data sources are in fact related. Each of these systems may have constructs for Customer / Account and Product that allow those systems to connect to their respective transactions internally. Master tables may need to be constructed that relate the mapping between systems to allow the different systems to be joined together for reporting purposes.
Therefore, after importing all of the base tables from each individual data source, and then creating all of the joins between the tables of each data source to create the data models supporting the table structures within each data source, it may very well be necessary to create new schemas to begin constructing those master tables that can be used as common dimensions for consolidated reporting across multiple data sources (Common_Customer, Common_Product, Common_User, etc.). These common dimension tables can then be joined to each data source schema creating the “bridges” that will allow information from one data source to be logically and reliably joined to data from another previously unavailable set of data. Subsequently, if attributes from these master dimensions are then used when constructing reports or used as fields in business views that are used to construct reports, then Incorta will know how to construct the join paths between the systems to get the correct transactional data from each data source.
There are many considerations to take into account when loading data. The first consideration is for how tables within a single schema will load. Key factors include:
- Full versus Incremental considerations
- Table Dependencies and Load Order
- Derived Tables (Incorta Tables, Materialized Views, etc.)
In Incorta, tables are loaded by schema and load schedules are created for each schema. Therefore it is important to remember that ALL tables within a schema will be loaded for every schema load performed. Understanding the frequency for which it is necessary to load the data is a factor in schema design. For example, if a particular source system is relatively small, loads relatively quickly, and perhaps updates are only needed daily, then scheduling a FULL load of that schema once per day is likely a perfectly acceptable solution for loading that schema. On the other side of the spectrum, if reporting requires transactions from the source system to be made available for reporting in near real time, then implementing an approach for incremental loading of key transaction tables will be required to load the schema as often as every few minutes. In this case, other factors including key dimension tables, size of underlying tables, and size and desired speed of incremental loads, etc. may determine if a schema should be split into groups. Perhaps some of the tables in the application are support tables, change rarely, and can be reloaded less frequently, while some tables change with each transaction processed in the application. In this case, one schema could be run 1 time per day for support or dimensional tables, while another schema containing incremental tables could be running every few minutes to pull the most recent transactions for reporting.
There is no exact science to this process. All the tables for both Full and Incremental load types can remain in the same schema and can be scheduled to run every hour, for example, if the total time to load all of those tables is acceptable to the users. And many times there are advantages to allowing all “related” tables to remain in the same schema.
Each table within a schema, by default, will have a Full Load behavior. And if a schema is loaded then ALL the records from the data source for that table will be loaded as per the SQL that is defined, minus any rows excluded due to the application of a static load filter. For example, perhaps the source system has 100 million sales transactions going back 10 years. However, if the reporting requirement were to be that only transactions inclusive of the last 5 years are needed for reporting, then a load filter might be used such that each time the table is loaded only 5 years of data is pulled. But, since this is a Full Load, it means that each time the table is loaded, it is fully replaced, and a new parquet file is written. Incremental loads function in a way to avoid reloading the full table with each schema load. However, in order to accomplish this task, it is necessary for the system to know which attribute within the table will define which rows have changed or which rows are new. This can often be accomplished using “created on” or “updated on” date-time columns that exist within the table, or in some cases by using Max() values for an ID column, etc. Operationally, when an incremental load for a table is run, it can identify which rows have changed or been inserted in the table since the last time the schema was loaded. Incorta maintains this record of “when” a schema was last loaded such that the condition can be created to load the data “WHERE created on > last load date or updated on > last load date”. The mandatory conditions for accomplishing incremental loads include:
- Defining the columns in the Incorta table that define the row level uniqueness as “Keys” in the Incorta table. This can be a single unique identifier such as an ID or a combination of columns within the table that act as the attributes that define the uniqueness of each row.
- Defining which dates or date timestamps, or in some cases a key or ID column, that can be used to determine which records have been updated or inserted.
- #1 above is important because this is how Incorta determines which records are to be inserted versus which records are to be updated in place from records that have already been loaded previously.
- There is another application of Full and Incremental Loading called Snapshotting. There is the concept of Dense and Sparse Snapshotting in Incorta using slightly different full & incremental load techniques using keys that include dates such that different "effective versions" of a record can be stored in a table as they change over time. Please see the Snapshotting Article for more information on snapshotting techniques and when best to apply them.
Within a schema Incorta will begin to extract and load many tables in parallel. This is partly controlled by the connection pooling attribute set in the data source connection. However, that being said Incorta is intelligent enough to know table dependencies that are evident in the data model due to parent and child relationships that exist due to joins created between tables, etc. However, there are other things to know about controlling inside a schema the order in which various tables are loaded:
- Default: Do nothing and allow Incorta to load tables in the order it deems based on joins, etc.
- Create 2 or more Load Groups within the Schema to create groups of tables to load. Each group is loaded serially, but within each group, Incorta will choose the load order based on known dependencies.
- Materialized Views, which are run by Spark, begin to load only AFTER regular table loads are completed. So, if Group1 and Group2 were defined as 2 groups of tables to load in that order, Materialized Views would not begin to load until after Group2 completes. It should be noted that it is also possible to create groups to control the load order of Materialized Views.
- Finally, Incorta Analyzer Tables are a kind of table that exists in memory only and are created using the Analyzer UI. Any Incorta Analyzer Tables are created after Materialized Views are finished loading.
Note: The discussion above refers only to load order within a single schema. However, sometimes it is necessary to understand and manage load dependencies among multiple schemas. Currently, the only way to manage the dependency between Schema-A and Schema-B, where it is known ahead of time that Schema-A must run before Schema-B is to schedule Schema-B to start its load at a later time than Schema-A. Incorta will still understand if dependencies exist, and will place Schema-B in “Queued” status if it attempts to load while Schema-A is still loading. In multi source applications, the construction of the load orders of multiple Schemas must take into account these known dependencies prior to scheduling the jobs that will run the loads.
Many times it is necessary to apply some business logic to create different views of data for reporting. Perhaps the need is to create views of data that summarize, divide, or group data differently than what is provided by the base application tables or using specific attribute logic that defines the contents of the derived table. Incorta supports Many options to the developer to create derived tables, and they include:
Incorta-Over-Incorta Tables: The use of SQL written against tables already residing in Incorta performing whatever logic is required for the new table. This method uses Incorta’s internal PostgreSQL connection to run the SQL.
Materialized Views using Spark to create the derived table using one of the following techniques;
- Python Code
- Spark SQL
Incorta Analyzer Tables: Tables built using the same logic and methods provided when building insights for dashboards, with the results being save as a table.
Coming in Version 5 and after, the ability to create an native Incorta SQL table that is executed by the Incorta Engine using native Incorta tables and the power of direct data mapping, but using SQL to create the tables but retain the benefits and speed of the Incorta DDM mapping and engine. Like Incorta Analyzer tables, these table also reside only in memory.
Note: With the recent additions of PostgreSQL support and the upcoming support for Incorta SQL tables, it will soon be unnecessary to use the Incorta-Over-Incorta method of creating tables, as it is the slowest method for loading tables via SQL.
Many times, there is some confusion regarding where aggregation tables should be placed in schemas as they relate to the transactional tables on which they depend.
First a discussion on transactional tables. Transaction tables are typically the “child” of various “parent” tables For example, an Orders table may be the child of Customer, Order_Date, Product, and/or Region dimensional tables. In traditional data warehouse ETL modeling & loading, it is necessary to load all the dimension tables first such that all child records from the Orders table will be guaranteed to have a positive key match to each of its parent tables. In Incorta there may be some small benefit to separating master (parent) and transaction tables into separate schemas depending on how often or how frequently master and transaction tables are updated in the source application. For example, if it is known that the application from which the data comes will only update parent tables once per day, then a schema could be created to load those tables once per day. While if it is known that Orders are updated hourly, then there is no need to reload master tables hourly, and so Order transactions could be placed in a schema that runs on the appropriate update frequency. That being said, there is no harm in reloading parent tables more frequently, especially if there is no significant impact on load performance. Since incorta is “aware” of the parent to child relationships that exist in the DDM model, it will automatically manage if joins or PK index objects need to be rebuilt or updated based on the relationships that have been established between the tables in the schema.
Aggregation tables, on the other hand, could be running SQL or code (Spark python, etc.) to construct a new view of transactions based on the desired logic and the relationships existing between parents and children from one or more schemas. It is important to know requirements of each aggregation table, the kind of aggregation table chosen, and the potential intra and inter schema dependencies that exist for an aggregation table. A quick recap of aggregation table types:
- Incorta-Over-Incorta SQL table: using SQL against already existing Incorta tables from one or more schemas using Incorta’s internal PostgreSQL driver to build a table.
- Materialized Views: created with Spark SQL, PostgreSQL, Python, Scala, or R code to create tables based on the desired logic from already existing Incorta tables from one or more schemas.
- Incorta Analyzer table: in memory tables, created using the Incorta Analyzer UI and supporting all the capabilities that the Analyzer allows for creating dashboard insights.
- Incorta SQL tables: Available in version 5 and forward, the ability to create in memory incorta tables against already existing incorta tables using the incorta engine and DDM mapping to achieve much faster speed for loading these tables when compared to the SQL that runs via PostgreSQL jdbc connections. Also, this table type supports the most expansive ANSI SQL coverage when compared to Spark SQL or the PostgreSQL SQL engine.
When considering how each of these aggregation tables load in the context of a schema load, the following rules exist:
- Incorta Over Incorta Tables do NOT take into account any dependencies that exist between tables that it uses in its SQL. Therefore, if Table A & Table B are used in I-O-I table C, it will attempt to load at the same moment as Table A and B are loading in the same schema. This may cause undesired results, if the expectation is that Table C contains the data from Table A and B “after” they have loaded.
- Materialized Views within a schema run after all regular tables load, and in this respect, will gain the benefit of having fresh data from the current schema load before they are built.
- Incorta Analyzer Tables are constructed after both incorta tables and materialized views are run as part of the Post-Load phase of the schema load process. Therefore, these tables may use the freshest results from both tables and materialized views that exist during a given schema load.
- Incorta SQL tables, similar to Incorta Analyzer tables, run during the Post-Load stage of the schema loading process and thus also gain the benefit of any dependencies that may exist between it and underlying tables or materialized views.
- Cross-schema dependencies for tables used within Aggregated tables need to be managed according to the relationships that exist between the schemas. It is necessary to schedule the schemas to load in an order where dependent tables are loaded first before schemas containing aggregation tables that assume those tables have been loaded. Under the right conditions, if Schema-A contains a table that is used in Schema-B, as long as Schema-A is started before Schema-B, then Schema-B may be placed On Hold until the dependency is freed from Schema-A.
- A reminder that Intra-Schema loads and table load orders, as discussed above can be managed by creating load groups. These load groups can be created for both incorta tables and materialized views such that the tables that are needed to be loaded “first” are loaded before the next set of tables are loaded. However, it is also important to note that these groups are executed serially. For example, Group-1 followed by Group-2 followed by MView Group-1 followed by MView Group-2 followed by Post-Load. The serial execution can add time to the overall load time of the schema, so testing this type of configuration is recommended to ensure the performance falls within acceptable time frames.
Incorta provides a data type inference process. For SQL based data source, the data type will be defined based on the data types used in JDBC implementation. For CSV files, the data type will be generated based on a sample of the data itself. Please review and adjust after the schema table definition is complete and adjust the defaulted types if necessary. You can change the data type defined for a column in the Incorta schema definition, but a full load is then required to update the table (and parquet file) if the data is already used in formula column or other downstream data process. Please verify and confirm the data type as early in the design process as possible to avoid changing later. Once the data type is manually overridden, Incorta will not try to use the data type from the data type inference process to change it again for those existing columns. However, if you change the column name by changing the column alias in the SQL statement of the extract query, the data type will be defaulted by Incorta.
Int vs Long
- Incorta can cast the data type of Long to Int by simply changing the data type after the schema table is created.
- Using Int is preferable since Int uses less space in storage and in memory.
Double vs Long (or Int)
- A column with numeric value with decimal places may be recognized as a Long data type.
- Incorta uses a sample (subset) of data in the data type inference process. Therefore, if the data having decimal places is not included in the sample, then during loading of the data for the table, those records will be rejected.
- After the data type defined in Incorta is corrected, a full load is required. If the column is already used in a join, full load of the child tables are required.
Timestamp vs Date
- For some selected data sources, the "date" data stored in the data source may have the time component by default, and if so, Incorta will use the Timestamp data type as the default type during schema table creation. It is also possible to use implicit casting to just change the data type from Timestamp to Date at load time.
- If the data needs to be available both as a Date and as a Timestamp, a formula column can be used with the removeTime() function to explicitly cast the data from timestamp to date.
Converting the data into a specific data type in extraction query vs in Incorta
- You can use SQL CAST function in the extraction query since it is not a very expensive operation.
- In some cases, the data source type may not support the data type conversion function, creating a formula column in Incorta can be an alternate way to convert the data into a specific data type.
- When multiple data sets are used, we need to ensure that all data sets return the same data type of a given column.
Text data type
- Please avoid using data type Text in your schema unless it is absolutely necessary. Sometimes it is used since Incorta gets the data type via the data type defined in the source or the data type determined by the JDBC driver.
- If Text is used as the data type for a column, it will not be able to be used as a grouping attribute. It can be changed from Text to String to override the data type generated from the data type inference process. However, the String column may not be able to hold the size of the data that the Text data can hold and the data may be stripped / truncated.
Note for Oracle Applications and Oracle database about date type DATE:
Oracle databases support both the date type of Date and Timestamp. When Oracle database RDBMS stores data with the data type DATE, it stores the time component as well.
However, Incorta Schema Wizard and SQL data type inference process, by default, creates those columns from the Oracle database with the date type DATE as Timestamp columns in Incorta.
Oracle Apps use the data type DATE when the time is not significant. Please use the implicit casting by changing the data type after the extract SQL is defined or the table is generated from schema wizard.
Data type conversion functions frequently used in Incorta
- formatDate - convert a date or a timestamp to a string
- parseDate - convert a string to a date
- parseTimestamp - convert a string to a timestamp
- removeTime - convert a Timestamp column to a Date column
- epoch - convert a date or a timestamp to a epoch number
- double - convert a string to a double
- int - convert a string or a double as an integer
Just as in any database system, a table in Incorta should have a Key defined. This can be a single column or multiple columns, but the fields should define what will result in a unique row of data in the table by using this Key. Incorta will use this information in the following ways:
- Joins, especially on Parent tables, should be joined on Key Columns to ensure that the "single" parent row will be returned when joined to a child transaction record. This is a very important aspect. Many times when queries are not performing as expected, one of the first steps is to inspect the joins to ensure they are correct and using Key columns for Parent tables.
- Incremental Loading will use the Key defined for a table to determine if it needs to update (replace) and existing row in the table or do an insert of a new row.
- Some Incorta internal functions, such as Lookup, actually require that the fields used to return results from the lookup include the Key columns. Since getting the data in this method is not using a join, it requires Keys to be used to ensure it will only return a single result from the lookup.
Note: Regarding Joins on Key: Often times it is necessary to define an Alternate Key for a table in order to accomplish a Join to a different Table for a legitimate business reason. This is the perfect opportunity to create an Alias on the base table, such that an alternate key and join can be constructed to the table.
Column width in Incorta is not explicitly defined. Only the data type (long, double, Int, string) are defined as part of the table definition process as described above. If there is a business reason to test, check, or validate the length of a string for some business process or transformation rule, it is possible to use various string built in functions to check length() and other related string function which may be implemented based on the result of the test. For example:
if( length(batch_number) < 10, concat( repeat( "0", 10 - length(batch_number) ), batch_number ), batch_number )
In the example above, if the length of Batch Number is < 10, then it will Concatenate some number of "0" to the front of the Batch Number such that the length of the resulting string is 10. (ie: 123456 results in 0000123456).
Situations as described in the example above are often used to create the ability to join between disparate data sources as well. Using the example, if System-A stored batch number as an integer 123456, and System-B stored batch number as a string of "0000123456", then a formula column could be constructed as above such that the 2 systems can be related by Batch Number (and assuming the users are used to "seeing" batch number in the 10-digit format in reports.
Sometimes when creating a join to a supporting table it is necessary to use conditions other than just the required join columns designating the key or business key. In such cases, it is possible to add additional conditions to the join. Here are a couple common examples:
Example 1: Choose from Among multiple record Types.
A table of Country Codes to Country Names exists containing Country Names for each Country Code in many languages. If the desire is to only pull Country Names with English spelling for reporting, the Join can be constructed as follows:
Child Table Schema1.Transactions
Parent Table Schema1.Country_names
Schema1.Transactions.country.code = Schema1.Country_names.Country
Schema1.Country_names.Language = E
Note: The column used to create the additional condition must be a column existing in the Parent table.
Example 2: Choose from Among multiple effective records.
A table of Exchange Rates exists containing the conversion rate to the Dollar for each Country Code for different periods of time. If the Desire is to only pull Exchange rate to calculate US Dollars for the date associated with the Transaction, the Join can be constructed as follows:
Child Table Schema1.Transactions
Parent Table Schema1.Exchange_Rates
Schema1.Transactions.country.code = Schema1.Exchange_Rates.Country
Schema1.Transactions.Transaction_date >= Schema1.Exchange_Rates.Effective_Dt
Schema1.Transactions.Transaction_date <= Schema1.Exchange_Rates.Expiration_Dt
Note: alternatively, in the example above, the join could just use Effective_Dt for the condition, since there would still be a single result using only the one date. However, in some situations it is necessary to use a range if the Transaction itself identifies a range of time.