Data extraction is one of the key components of any BI Product. It is very important to make sure that the infrastructure and processes responsible for extracting data are highly reliable and can fail over without any manual intervention. And when the data becomes very large it also becomes important to distribute the load for performance reasons or due to limitations on the capacity of available hardware infrastructure.
Incorta's Loader Service is responsible for extracting data from various data sources and needs to be highly available to make sure that no single point of failure can bring down this key component. This article will help you to configure this service for high availability as well as to distribute the load.
What you need to know before reading this article
This article requires knowledge of Incorta Installation and Administration. It also requires understanding concepts of load balancing and high availability and failover.
When an Incorta cluster is deployed, the default configuration consists of a single loader service (for data acquisition) and a single analytics service (for analytics on the extracted data). Data extraction will be performed by a single loader service and also all the analytics users are served by a single analytics service. There is no high availability for both loader and analytics services.
The following diagram represents an Incorta architecture with a single loader and single analytics service. If any of the service go down the corresponding loader / analytics functionality will not be available.
Simple and low cost solution
No High Availability, not suitable for applications requiring high availability
High Availability at Analytics Level
The most common option customers go with is by provisioning multiple analytics services to make Analytics high available. So if a particular analytics service go down users are still served from the other analytics services configured.
The diagram below represents one loader (no high-availability) and three analytics services (high-availability).
High Availability at Analytics level
Loader service failures will leave users with stale data
Load Balancing and High Availability at both Loader and Analytics
Having high availability at analytics will ensure that analytics users are not affected by individual analytics service failures. However since there is no high availability at loader level, there is a possibility of the loader service failing and leaving users with stale data.
The simple solution is to add multiple loader services ensuring high availability for extraction of data. However, in addition to that, a little more complex architecture can be configured to evenly distribute the extraction load among multiple loader services.
In the following architecture diagram, three primary loader services are defined. Each, extracting data related to a particular business area. This is achieved by assigning schemas to specific loader service in the distribution.properties file in the tenant directory.
Distribution file for the above architecture
In the above distribution file, schemas related to three different business areas are assigned to three different loader services. In addition, a backup loader service is also configured to provide high availability in case of primary loader service failure.
Note: There is a default loader service assigned as well to honor extraction of data for schemas that have been defined but not yet assigned to any specific loader service. If this is not specified, any new schemas that are not yet assigned to a loader service will not be extracted.
Advantages of the above architecture
Distribution of data extraction by assigning business specific schemas to specific loader services
High Availability of loader service by defining an additional loader service for each business area
High Availability of Analytics
This architecture is also suitable for organizations that cannot afford high end machines and need to use commodity hardware limited by memory and processing power.