Hardware Sizing Recommendations
Many factors can influence how we consider properly designing the topology and hardware required for our customers' environments. Data volumes, incremental refreshes, end-user concurrency, high-availability, the need for Spark, and other factors dictate how we design our Incorta environments. Due to these complexities, Incorta sizing ends up being as much of an "art" as it is a "science". The purpose of this document is to educate how we think about proper Incorta hardware sizing, what factors influence our design decisions, and to provide some "rules of thumb" that can be applied in practice for our customers.
The Incorta Professional Services team is always available to guide our partners and customers with their Incorta hardware sizing initiatives. The Incorta Professional Services team has developed a HW sizing questionnaire that collects the necessary inputs about the customer's environment and application which inform a proper sizing. Please contact your Incorta team to gain access to this sizing questionnaire.
We recommend that you be familiar with these Incorta concepts before exploring this topic further.
These concepts apply to all versions 4.5+ and version 5.0 of Incorta
The three primary components of the Incorta platform are the Loader Service, the Analytics Service and the Incorta Storage Layer. The loader service, as the name would imply, is responsible for invoking the Incorta connectors to pull data into the Incorta platform. The loader service either pulls data in full or incrementally via the Incorta connectors. The loader service requires enough memory (aka RAM) to house about 1.5x the size of the compressed data in the storage layer. It requires this memory in order to keep all of the data on the ready so that incremental loads can be merged and subsequently written to the Incorta storage layer. The columnar storage layer uses a parquet file format to store the data on shared disk. Once the updates have been written to the storage layer, the analytics service is alerted to which columns have changed and it pulls in the changes accordingly with no downtime to the end users. The analytics service is responsible for housing the in-memory direct data mapping (DDM) service for the platform. This service answers all incoming queries of the data, whether through the Incorta UI or the Incorta SQL interface – both of which are also contained within the analytics service. The CMC, or Cluster Management Console, is the administration, monitoring and provisioning service for the platform. It can create, start, stop, and monitor the different loader or analytics service(s) in the environment. The CMC becomes particularly useful when multiple servers or services exist within the Incorta environment. Spark, which is optional, can be leveraged to address advanced use cases like materialized view creation and/or SQL queries back to the parquet files on disk. The reference architecture and table below provides a more detailed description of each component.
|Not Depicted||Incorta Node||An Incorta "node" is defined as any server where the Incorta platform is installed. Through Incorta node "agents", the Incorta Cluster Management Console (CMC) can provision and administer any of the Incorta services (Loader or Analytics) on any Incorta node in the environment. In the diagram above, all of the components displayed are running on a single server, or single Incorta node.|
|Not Depicted||Zookeeper||Incorta installs with Apache Zookeeper. Zookeeper is used to coordinate messaging between components as well as to elect new leader services in the case where services fail – which is particularly important with high availability configurations.|
|Loader Service||The loader service is responsible for loading data, fully or incrementally, into the Incorta platform. The loader service runs as its own java process within the Incorta platform and can be provisioned by the CMC to run on the same, or separate, Incorta node as the analytics service. The loader service invokes the connectors to pull data from the source systems. The loader service loads the data into memory (on-heap or off-heap) to build the columnar compressed parquet and DDM files that the platform requires. As of version 4.5, the loader node will only keep the columns in memory it needs in order to calculate joins or formula columns.|
|Analytics Service||Like the loader service, the analytics service runs as its own java process within the Incorta platform and can be provisioned by the CMC to run on the same, or separate, Incorta node as the loader service. Primarily, the analytics service is responsible for housing the in-memory direct data mapping (DDM) engine for the platform. This engine answers all incoming queries of the data, whether through the Incorta UI or the Incorta SQL interface – both of which are also contained within the analytics service. Once the loader service has written data updates to the storage layer, the analytics service will pick up those changes, column by column, without any service interruption to the end users.|
|Cluster Management Console||The CMC, or Cluster Management Console, is the administration, monitoring and provisioning service for the platform. It too runs as its own web-based java process. It offers a friendly, web-based console to the Incorta platform administrators. It can provision, start, stop, and monitor the different loader or analytics services in the environment.|
|Storage Layer||The Incorta storage layer is comprised of two parts: the metadata database, which can run as a Derby, MySQL or Oracle database and the shared disk which can be mounted equally to the Incorta nodes running the loader and analytics services. The metadata database is responsible for persisting the configuration details of the Incorta application (users, groups, dashboard definitions, schema definitions, etc.). The shared disk will store the columnar parquet files (~one parquet file per table) for the platform.|
|Spark||Spark, which installs with the platform but does not have to be used, can be employed to address advanced use cases like materialized view creation and/or SQL query back to the parquet files on disk. The size of the spark deployment depends on the usage. For more straightforward uses or those with smaller datasets, the single "stand alone" Spark utility that installs with Incorta is usually sufficient. For more complex uses or those with larger datasets, Spark may be installed on its own, separate server cluster. Consult Incorta professional services or a trusted Incorta partner for Spark sizing guidance.|
As an in-memory analytics platform, the volume of data we anticipate bringing into the platform is one of the most important considerations for properly sizing the infrastructure of an Incorta environment. When we consider data volumes, we want to hone in on the expected volume of data in-memory in Incorta. Because the data in the Incorta platform is compressed, we can’t rely on the data volumes as reported from the source systems solely. Here are some rules for estimating the expected data volumes in Incorta:
- Load data into Incorta first. Obviously, this is the most straightforward way to determine the anticipated in-memory volumes for your Incorta implementation. However, during POCs or during development, we are not always afforded this luxury. Often times, we are only allowed to work with subsets of the data upfront and, in those cases, we have to apply some of the other rules called out below to arrive at responsible data volume estimates. If we do have the luxury of loading all of the customer's production data upfront, we can gather the in-memory data volumes from the schema page or from the memory.jsp page.
- Look to the tables in the source. It can be tempting to simply ask our DBAs how big our source systems are and use that as our guide for anticipated Incorta data volume. This approach, however, is almost always overly conservative. While Incorta makes it very easy to bring any and all tables from our source systems, why would we bloat our Incorta platform with data that doesn’t have any analytical value? For this reason, it is almost always a good idea to hone in on the volumes of the tables you plan on bringing into the platform. And, frankly, getting the volumes (in number of records or in amount of memory) of the 10 largest tables is usually more than sufficient. Many of the smaller tables (less than 1M rows, for example) don’t “move the needle” that much.
- Interrogate the source system. If you have the means, SQL scripts can be written against relational database source systems that will provide metadata about each table’s record count and size. Incorta has this script available for EBS database’s running on Oracle.
- Apply a compression factor. Once we have a decent understanding of which tables we intend to bring in to Incorta and their relative size, we must apply a compression factor to properly arrive at our anticipated in-memory volumes in Incorta. As previously stated, data brought in to the Incorta platform is compressed. This is true both for data at rest in the parquet layer on disk and for data in-memory in the analytics service. Thus, when calculating the anticipated data volumes in Incorta, we must apply a compression factor to the data identified in our source systems. Typically, we see compression ratios of about 30-70% in Incorta. The compression ratio is always a function of the cardinality of our data. Cardinality describes the uniqueness of our data. A column in our table could be said to have “high cardinality” if the values are nearly unique like a timestamp or a key column. High cardinality data does not compress as effectively (~20-30% for example). Conversely, a column with “low cardinality” has fewer distinct values like a country code or a product category. Low cardinality data compresses very well (~70-80% for example). The most reliable way to determine the appropriate compression ratio is to load some of the data into Incorta from the source system and compare the source size to the Incorta in-memory size. Barring that, we typically use a conservative compression ratio of ~40% in our estimation efforts.
How we plan to refresh our data in Incorta can have a significant impact on how we choose to provision the Incorta services in our environment. In previous versions of Incorta (e.g. version 3.x and prior), the Incorta dashboards would become unresponsive during data reloads causing a frustrating experience to end users when incremental refreshes were being applied. In 4.x and beyond, the Incorta loader service has been decoupled from the analytics service alleviating this issues. However, our refresh cadence and strategy still impacts our design decisions for our Incorta environment. Provisioning the Incorta loader service on separate hardware from the analytics service is a good idea for Incorta applications that execute frequent or voluminous updates. This separation avoids resource constraints on the analytics service imposed by the loader service during updates. This separation, however, may not be necessary for Incorta applications that deliver infrequent data refreshes or when overall data volumes are small. Lastly, it is worth noting that the Incorta loader service can execute resource intensive tasks during data refreshes, like rebuilding large joins between tables or recalculating formula columns, even if the incremental refresh data volumes are small.
Understanding the anticipated end-user concurrency and query load is an important consideration when designing the manner in which the analytics service is provisioned in your Incorta environment. Deploying more than one analytics service is required for high-availability environments, but it may also be required if high end-user concurrency is anticipated. The number of simultaneous end users querying the system is defines our end-user concurrency. Peak end-user concurrency defines the maximum number of simultaneous end-users who may ever be accessing Incorta at the same time. We do want to be conservative when estimating this metric but, too often, we are inclined to assume that this number is 50-75% the named end-user count. Named end-users are defined by the total number of people who have access to the system. In our experience, peak end-user concurrency rarely gets above 20% of the named end-user population. While it ultimately depends on the underlying CPU cores available to each analytics service, it typically makes sense to start considering introducing a new analytics service on its own server when the concurrent end-user count equals X.
Spark is a very powerful component of the Incorta platform that unlocks a wide variety of advanced use cases for our customers. That said, it isn’t required for every use case. It is important to understand enough of how your Incorta application is going to be designed and implemented upfront to know if Spark is going to be required. Derived tables (aka. materialized views) that are defined by python scripting (aka. PySpark) as well as aspects of Incorta’s SQL interface for downstream 3rd party visual tools like Tableau and PowerBI all require Spark to be running and available. Understanding if Spark is going to be used is an important part of designing the topology of your Incorta environment. If Spark is going to be used is one thing, but subsequently sizing Spark for your Incorta environment is a sizing exercise all its own. Check out the “Sizing Spark” section of this guide below to learn more.
Some organizations see their analytical applications as mission critical and require 24/7 up time. For these customers we design and provision our Incorta environment with high availability in mind. Most commonly, this means provisioning more than one analytic service in an active-active fashion behind a load balancer. Some customers go further and want to also ensure fault tolerance for the loader service and metadata database. Understanding these up time requirements is as important as any of the other considerations discussed here when designing your Incorta topology.
As the options enumerated below would suggest, there are a number of ways in which the topology of an Incorta environment can be designed. The table below showcases some of the more common topology options, ordered from most straightforward to most complex. These options are accompanied by recommendations which highlight when each option is most effective. These options, however, do not represent every permutation of environment design.
A few things to point out that are not captured in the visual topologies below:
- Zookeeper – Incorta requires zookeeper to run, even on a single server deployment. Zookeeper requires an odd number of servers to run which needs to be considered when designing your Incorta environment. Zookeeper is not depicted in the topology options below.
- Shared Disk – Shared disk between the loader service and analytics service is required for Incorta to run. In the most straightforward case, if Incorta is installed on a single server, the shared disk can simply be the local linux filesystem. In more complex topologies, shared disk needs to be mounted to each server running an Incorta loader service and/or analytics service.
- 100% High Availability – For truly fault tolerant and HA architectures, the metadata database and the CMC can be setup with redundancy too. This is less common, however, and is not depicted in the options below.
- Web Servers – For customers who wish to deploy Incorta with access to external customers or through a DMZ, a web server "layer" may be required "in front" of the Incorta analytics service. A topology with web servers is not depicted in the topology options below.
|Single Server||Description||Best Used When|
|For the most straightforward use cases with small to moderate datasets and relatively infrequent incremental loads, installing all of the Incorta services on a single server is the best and most straightforward approach. Although not depicted here, Spark can also be run on this single server setup if the data volumes are small and the usage is straightforward.||
|Decoupling Load from Analytics|
|This is the most commonly deployed topology for Incorta. Splitting the loader service onto a separate server from the analytics service eliminates any resource contention for CPU when full or incremental loads are running. The Incorta node containing the loader service will need more memory (~1.5x size of data in compressed storage layer). The analytics service, which will only load the columns the dashboards require from the storage layer, typically needs much less memory (~1x size of data in compressed storage layer).||
|Decoupling Load from Analytics - Spark Separated|
|When applications built on the Incorta platform require heavy Spark usage, it can make sense to break Spark off onto its own server/cluster.||
|Decoupled, Redundant Analytics Services|
There are three primary reasons to consider introducing additional analytics services to your Incorta topology.
|Decoupled, Redundant Analytics Services - Spark Separated|
|If there are requirements that would require redundant analytics services (described above) and there is heavy Spark usage, this is the right topology.||
|Decoupled, Redundant Load & Analytics Services|
|For Incorta environments that support numerous tenants and schemas with a large amount of data across, at some point it makes sense to introduce additional loader services. Incorta supports the "pinning" of data loading to particular loader services at the tenant and/or the schema level. Provisioning Incorta applications to run across a multitude of loader and analytics services really helps customers maintain a single Production environment for all Incorta use cases.||
|Decoupled, Redundant Load & Analytics Services - Spark Separated|
|Same as above, but for heavy Spark usage. Please note the metadata database can be run on existing Oracle or MySQL databases that do not have to be installed on the Incorta servers themselves.||
The following section outlines our common recommendations around hardware, both for on-premises configurations and for cloud-based configurations. Please note that this section is not intended to prescribe the specific servers or the amount of compute required, as that is the role of this document as a whole. Rather, this section is to help define the types of hardware to consider for both on-premises as well as cloud-based deployments.
- Bare-metal or Virtual Machines: Both are viable. If Virtual Machines (VMs) are used there will be a slight performance penalty for the Incorta platform. Additionally, the VMs should be given dedicated resources from the underlying physical servers to avoid resource contention with other VMs.
- Operating System: Incorta recommends 64-bit linux for the O/S. While Incorta is a java based platform which can also run on Windows, we do all of our own internal development and testing using RHEL 6/7 and/or CentOS 7. What's more, if Spark is to be used, there are a lot of potential issues running Spark on a Windows O/S.
- CPU: Most modern X86-64 bit chip architectures will be sufficient for Incorta. Incorta typically wants to ensure that at least 16 cores from the CPU(s) are available to both the analytics service and loader service, however, as stated above, the exact number of cores will be a function of the sizing considerations called out above. An example CPU that would be sufficient is as follows: Intel Xeon X86-64 CPU E5-2680 v2 @ 2.80GHz.
- RAM: Physical memory requirements are going to be a function of the sizing considerations called out above including data volumes, spark usage, topology and more. Most customers run the services of the platform on servers that have between 64-512GB of RAM.
- Shared Disk: There are a variety of ways to mount storage so that it can be "seen" by multiple servers. This can be in form of shared SAN drives, shared NFS drives, cross mounting between servers, etc. Incorta doesn't care how this shared disk is accomplished as long as it offers each service in the Incorta platform good performance. We typically recommend SSD with around 5000 iops and throughput of 500 MB/s.
The instance type used, as one might imagine is also a function of the sizing considerations explained above. Cloud-based VMs should be sized in a similar fashion to the on-premises servers or VMs.
- AWS EC2 - Here are some common image types used (typically from the "Memory Optimized" family of instances):
- AWS Shared disk options:
- If only 2 nodes are planned - you can use NFS disk hosted by one of the nodes, or
- Provisioned EFS with a minimum of 250MB/s or higher
- Azure - The most common Azure instances used with Incorta are:
- F64 v2 (128 GB, 64 vCPU)
- D64s v3 (256GB, 64 vCPU)
- M64s (512GB, 64 vCPU)
Whether we deploy on-premises or in the cloud, the number of environments is really up to the customer and their common practices around code migration. At a minimum, we recommend at least two environments – DEV and PROD. In this simplest of migration paths, DEV should serve as both the development and testing environment. To what degree possible, if there are only two environments, we strongly recommend that the DEV-TEST and the PROD environment are of equivalent topology design and hardware resources. This allows for much more effective troubleshooting down the road. For a more sophisticated BI environment, it makes sense to have 3 environments, DEV, UAT and PROD.
Most of the "rules of thumb" we apply for sizing our Incorta environment depend on the amount of compressed data we expect in the storage layer, which we call "x" below. Please reference the "Data Volumes" primary consideration section above to understand how we estimate "x". The list below enumerates the main rules we apply when sizing our Incorta hardware.
- The amount of memory (i.e. RAM) required by the loader service should be roughly 1.5x the size of the compressed data in Incorta memory as reported by the Schema page in the platform. For example, if we have 100GB of data reported by the Incorta Schema page, we would want the loader service to have access to 150GB of RAM.
- The amount of RAM for the analytics service is a bit trickier as the analytics service only loads the columns required to render the dashboards or service the queries through the SQL interface. Thus, if we have 5000 columns across all of the tables in our schemas, yet our dashboards only employ 500 of the columns for display, filters, calculations, etc., then we would only require 10% of the memory required by the loader service. We are typically conservative here, however, as dashboards and query profiles can change over time. For that reason, we typically suggest that the analytics service should have about 1x the memory of the size of the compressed data (or 50% of the memory available to it as the loader service). NOTE: Version 4.9 and 5.0 of Incorta, specifically, introduces a new feature called "background sync" which loads updated columns into an offline temporary memory space first and then performs an in-memory swap with the old column. This feature, thus, has a slightly higher need for memory than previous 4.x version as it relates to the analytics service. This syncing feature is done schema by schema so if you have a big schema with lots of volume, consider bumping this up to 1.2-1.5x (depending on the size of your largest schema in relation to the whole data footprint).
- The amount of disk required for the shared storage layer should be at least 10x the size.
- The number of cores for the loader service is a function of how many tables and schemas load in parallel and we typically recommend at least 16 cores for the server.
- The number of cores for the analytics service is a function of peak end-user concurrency and average number of insights per dashboard and we typically recommend at least 16 cores for each server in your topology.
- For both the analytics service and the loader service, through the CMC, set the on-heap memory to 25% of the memory allocated to the service and set the off-heap memory to 75% of the memory allocated to the service.
- Leave about 10GB of RAM for the server operating system when configuring your hardware or VMs.
Incorta v4.5+ and v5 support both on-heap and off-heap memory management for both the analytics and the loader service. On-heap memory refers to memory managed inside the java process itself, or the java "heap" space. Off-heap memory refers to memory managed outside the java process. As java processes, the analytics service and loader service both have settings in the CMC that control the on-heap usage vs. the off-heap usage. Please see the screenshot below from the CMC.
As a rule of thumb, it is a good practice to keep about 75% of the memory offered to Incorta off-heap. This keeps the memory in the java process for the loader and analytics service (aka. the on-heap memory) small which makes java maintenance activities like garbage collection less intrusive.
If we revisit the concept of "x" again from the "Rules of Thumb" section above, we are ultimately interested in the amount of compressed data we expect in the Incorta storage layer (aka. "x"). Based on the rules of thumb above, we will plan for 1.5x memory for the loader service and 1x memory for the analytics service. The "Memory Size" setting controls the on-heap memory size and this should be 25% of the 1.5x or 1x memory size, for the loader and analytics service respectively. The off-heap component should be set to about 75% of the 1.5x or 1x memory size, for the loader and analytics service respectively.
Keep in mind these settings only control the amount of memory allotted to the Incorta services. For example, you may have a 1TB RAM server and only choose to allocate 150GB to the loader service and 100GB to the analytics service.
The following section provides a few responsible sizing scenarios for Incorta production environments. Keep in mind that the following is only meant to showcase suitable examples of valid hardware sizing based on the inputs. There are certainly other valid options.
|Considerations||Production Topology||Production Server(s) Recommendation|
1 VM, AWS m5.4xlarge, CentOS 64-bit linux, 100 GB mounted EBS Storage volume disk
1 VM, CentOS 64-bit, 64 GB of RAM, 100 GB local disk, 16 virtual CPUs, 100 GB of local disk
|Considerations||Production Topology||Production Server(s) Recommendation|
3 VMs (all CentOS):
3 physical servers (all CentOS):
|Considerations||Production Topology||Production Server(s) Recommendation|
5 physical servers (all RHEL 7):