cancel
Showing results for 
Search instead for 
Did you mean: 
Tristan
Employee
Employee

Introduction

Incorta Materialized Views are a powerful way to enrich data contained in Incorta tables.  Leveraging Spark's processing engine, Materialized Views (MVs) can be defined to introduce enrichments and advanced analytics to reshape your data to your reporting needs.  There are a handful of performance-related things to consider when building your MVs. 

What you should know before reading this article

We recommend that you be familiar with these Incorta concepts before exploring this topic further.

Applies to

The information in this article applies to versions 4.0 of Incorta and above. 

Let’s Go

Performance Tuning Spark MVs

Often the logic used in materialized views can be complex and can have performance issues. There are different levers you can pull when performance tuning materialized views.

  1. Look into optimizing the underlying Python or SQL logic used in the MV. The biggest performance gains can be obtained by enhancing the logic used in the MV
  2. Configure spark parameters to improve the performance of the MV
  3. Clean up open notebooks that may be consuming spark memory to free up memory for the MV
  4. Use the PySpark option and break the sql statements into multiple data frames and then do the final processing at the end

Optimizing MV SQL logic

  1. Watch out for UNION versus UNION ALL. UNION will do a distinct and has a higher cost than UNION ALL. Use UNION ALL instead of UNION which will give you a huge performance gain when you are unioning multiple SQLs.
  2. Watch out for DISTINCT. The usage of DISTINCT in any of the SQL statements will cause performance degradation. You can accomplish the same thing via a GROUP BY.
  3. Check if the SQL logic is correct and if it can be rewritten using fewer UNIONS and more restrictive filters to reduce the number of rows processed.
  4. The Incorta engine does deduplication based on the key definition. Alternatively, you may eliminate the duplicates by the key defined, and not use UNION, DISTINCT, or GROUP BY operations mentioned above.  However, it is recommended that you perform such deduplication in Spark, not relying on the Incorta engine since the disk I/O can be reduced.  The Incorta parquet compaction process will also have less data to process as a result.
  5. If you need to use the result of a query as a filter for another query, in other words a typical EXISTS or IN subquery, consider using LEFT OUTER JOIN instead.  If you are using SQL in PySpark, consider using the ANTI JOIN or SEMI JOIN in such cases.
  6. When denormalizing data from a parent table to child table, consider performing the join as a regular incorta join, especially if the parent table has a large column of string data.  This can reduce I/O and disk space and thus shorten the MV processing time.
  7. Watch out for MVs using Incorta postgreSQL queries based on a business schema. These may result in multiple joins and complex plans.  In such a case, consider doing the joins using the Incorta engine instead of pushing the joins into an MV.
  8. Another option is to use the Pyspark option and break the sql statements into multiple data frames and then do the final processing at the end. Refer to this article for further information.

Optimizing Spark Parameters

Once you have looked into MV logic optimization, you can look into configuring the spark parameters. Before you do that check how much memory is available. This can be done from the Spark page. If you are an Incorta Cloud customer, you will have to contact Incorta support for the URL and credentials. 

Get your spark URL and check how much memory and how many cores are available. The size of memory and the number of cores available is limited by the server that hosts the Spark cluster. It is also configurable from spark-env.sh file.

Tristan_0-1647291190891.png

Available Spark configurations for MV are listed below.  Note that these are based on the machine specs and impact the performance significantly. The configuration tab of a sample MV can looks like this. Start by setting the five key properties mentioned below and go from there: 

Tristan_1-1647291190889.png    

Key spark parameters

  1. Spark.driver.memory: Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in GB unless otherwise specified (e.g. 1g, 2g). This is the memory allocated for the spark logical plan and any spark actions that transfer the data to the driver. The default is 1g. 
  2. Spark.executor.memory: Amount of memory to use per executor process, in GB unless otherwise specified (e.g. 2g, 8g). This is the memory allocated for executing the logic and by default it is 1g. Check the amount of memory available from the spark page for use before setting this parameter.
  3. Spark.cores.max: The maximum amount of CPU cores to request for the application from across the cluster. For example, if we set the cores.max to 44, that means the MV will only use 44 cores from the cluster.  Check the amount of cores available from the spark page for use before setting this parameter.
    • NOTE: For Incorta Cloud this parameter is not used and we need to use spark.executor.instances. This is the outcome of dividing cores.max by executor.cores
  4. Spark.executor.cores: The number of cores to use with each executor. If this is the same as Spark.cores.max then only one executor will be used. For eg if Spark.executor.cores is 2 and Spark.cores.max is 4, then two executors will be used. If Spark.executor.memory is set to 2g then a total of 2 * 2g which is 4g will be used.  
  5. Spark.sql.shuffle.partitions: Configures the number of partitions to use when shuffling data for joins  or aggregations. Initially set the shuffle partition equals to cores.max or its multiple and as the data increases you can increase it by (cores.max * 2, cores.max *3, cores.max *4, …) and so on.  Start with 4 and then increase as needed. 

Related Material

Contributors
Version history
Last update:
‎03-16-2022 11:26 AM
Updated by: