Near Real Time Reporting
Incorta is not a real-time data platform, however given its unique approach that allows customers to leave data “in-shape”, paired with its speed when it comes to extracting and preparing data, it can support very frequent data refreshes. That said, the tighter the window, the more important it is to get your design right. This article focuses on design principles that will help you load your data to Incorta quickly so that your reporting can be refreshed more frequently.
We recommend that you be familiar with these Incorta concepts before exploring this topic further.
- Performance Tuning Guide
- Schema Manager
- About Tables
- Importing and Loading Data
- Incorta Materialized Views
- Incorta Analyzer Table
Most of the information in this article applies to version 4.0 of Incorta and above. Where applicable, the specific versions of Incorta that are relevant are mentioned below.
The key to near real time data refreshes is to optimize your load time performance. The following tips can help.
When optimizing schema extract and load, it is important to understand that Incorta knows the dependencies between objects and will load them in the order that makes sense automatically. This simplifies your life a great deal, as within a particular schema Incorta will load the tables it can in parallel and will make sure to load other objects in the required order.
While it is possible to spread your table objects around across schemas and you may be tempted to do so to keep your schemas “small”, it is more load efficient to keep related objects together in one schema. For each schema load, there is a small but non-negligible amount of time for the startup and close down of a schema load so doing two or more schema loads for the same set of tables where there are dependencies that would prevent the schemas from running in parallel is more expensive than just doing one schema load.
The second complication with using multiple schemas for dependent objects is that it is not currently possible to set up dependencies in your load schedule from the UI. This means that you have to schedule loads to run at a specific time or interval. So you would schedule the first schema to run at a specific time and then schedule the second dependent schema to run when you believe that the first one will have safely finished. This results in inefficiency because it is difficult to impossible to consistently time the load of the dependent schema without having a safety buffer of time to make sure that the first schema load has completed. You can overcome this by scheduling the dependent schema to run at the same time or just after the start of the first schema since Incorta will know that the two cannot run concurrently and will only start the second schema load after the first has finished.
The other factor to keep in mind is that there is a specific order that objects load based on their type. Load order:
- Materialized Views (MV)
- Incorta Analyzer Tables
Keeping this in mind, it is clear that insights based on Incorta Analyzer Tables have a longer path to data refresh than do insights based on standard tables. Do your best to keep the number of object layers to the minimum as it is possible to build MV’s on top of other MV’s and Incorta Analyzer Tables on top of business views that are built from MV’s and tables and possibly other business views, to name just a couple possible scenarios. It can get complex fast and the more layers involved, the more ordering comes into play and the longer it can take for your underlying columns to populate.
|Put related table objects in a single schema||Tables can be related based on the data they contain or the schedule upon which they need to refresh. Incorta can figure out how to load the objects within a schema optimally.|
|Minimize the layering of table objects||Incorta loads table objects in order based on object type. Object types that come later in the ordering benefit from having fewer layers.|
Use Incremental Loads when possible. Incremental loads only load what has changed to memory and as a result normally require less processing by Incorta which translates to the process from extraction to loading into Analyzer memory being faster than if a full table is loaded.
Incremental loads require a way to uniquely identify each record and to identify the last time a record was updated in the source. The easiest way to do this is with a column that holds the timestamp of the last update to the record. Depending on the object to be loaded, other logic can be used as well.
|Use incremental loads||Normally incremental loads are faster as less data is processed.|
For those objects that support chunking, set it up when you define the table object. Chunking allows the loads to do parallel processing which speeds load times. You can set the number of chuks that process in parallel by setting the Table Maximum Parallel Chunks configuration setting in the CMC (work with your Incorta Administrator). Note that not all data sources or source types support chunking.
|Using chunking strategies||When possible, use chunking strategies for loading your data. This can be really beneficial for large tables. It may not be worthwhile if the record counts are small.|
Data is only loaded to memory if a table is performance optimized. You do want to optimize all the columns of data that you will use in reporting so that you can maximize performance on your dashboards. That said, at times you may not report off of certain table objects at all. Typically these tables would contain data that needs further transformation (via MV) before the data is ready for reporting. For these tables, it makes sense to disable the Performance Optimized toggle on the Table Details screen and save the time it takes to load the data to memory.
|Turn Performance Optimize off for tables that do not need it||There is no reason to take up memory for tables that are not reported on and this removes work from the load process.|
It is also possible to not load columns into Incorta. If you will not be using columns for reporting, there is no reason to load them to Incorta and this will save some time in your load process.
Instead of defining you tables with “SELECT *” statements, specify only the columns that you will use in Incorta in your select clause.
|Do not load columns from source that are unneeded||If you know that you will not use a column, do not load it to Incorta. Instead of writing your table defining SQL as “SELECT * FROM table_abc”, specify exactly which columns to bring in with your select clause.|
If you have very important load jobs, consider a dedicated loader node. It is possible to pin jobs to specific loader nodes. The advantage of doing this is that you can set up the job such that it has no resource competition. As long as the dedicated loader node is big enough, then you can guarantee consistent performance.
The downside to this strategy is that it will add complexity to your hardware topology which will mean more for you to maintain if you are hosting on premises. Incorta Cloud is always an option to avoid that headache! That said, this is probably a last resort option to use after you have exhausted all other avenues for tuning first.
|Pin load job to dedicated loader node||This strategy can eliminate competition on your loader service and as a result give you consistency in your loads.|