Why do we need Spark
Spark is not mandatory but helps performing distributed processing on Parquet data. Complex stored procedures can be converted to PySpark programs and processed on Spark cluster.
Here is the link on how to setup Spark with Incorta: https://docs.incorta.com/4.4/spark/
Regarding configuring spark on a distributed node, in the article it is mentioned to copy spark_home to a shared drive. We should avoid that as it will lead to performance degradation due to running spark out of shared drive.
Instead follow these steps
1. Zip spark_home from the Incorta Node.
2. Unzip it on spark machine on its local disk NOT on shared disk.
3. Modify spark-env.sh and spark-defaults..conf to change the hostname to the name of spark machine. These files will be under spark_home/conf directory.