0

Pandas (and other Python libraries) in Materialized View

Hi all, 

I am a newer user of Incorta, and I'm wondering -- is it possible to import pandas into materialized views?

I have installed Anaconda in the Linux environment Incorta is housed, and can confirm it is functional via command line, but it does not seem to be communicating with Incorta. 

6replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Michael,

    Sometimes linux environments can have multiple installs of python.  Can you go to the linux command line of your server running Spark for Incorta, change to the user that Incorta runs as, and run the 'which python' command to confirm that Incorta is using the same install of python that you installed Pandas to?

    Thanks,
    Dan

    Reply Like
    • Also, keep in mind that Pandas, unlike standard dataframes, are not parallelizable.  This means that if and when you're running your Incorta materialized view on a Spark cluster (down the road), the Pandas logic won't be able to take advantage of the compute power across the whole cluster.

      Reply Like 1
  • Hi Michael,,

    1) In the <incorta home>/IncortaNode/spark/conf folder edit the spark-env.sh file to set the following line, save and restart spark:

    PYSPARK_PYTHON = <full path to python executable/python >

    (for reference : https://spark.apache.org/docs/latest/configuration.html#environment-variables)

    2) You can use Pandas but Pandas dataframes unlike Spark dataframes are mutable so spark cannot use them for processing in a distributed manner. There is a new python module called Koalas which is a new open source project that augments PySpark’s DataFrame API to make it compatible with pandas. Please refer to https://databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html

    Thanks

    Amit

    Reply Like 1
  • Hi Amit,

    I am trying to import wget in the Materialized View, but I am getting  "INC_005005001:Failed to load data from [spark://DESKTOP-KI0KIC7:7077] with properties [[error, No module named 'wget' ]]"

    Before this I was getting "Failed to connect to [spark] due to [null] with properties" after following Spark Integration documents, now I am getting this error. Please let me know how can I fix this.


    Thanks,

    Satya

    Reply Like
    • peddinti satya sashikanth you need to install the module on your server before using in Incorta. In my case, I used pip e.g. "pip install wget" and was then able to proceed.

      Thanks,

      Dustin

      Reply Like
    • Dustin Basil Hey thanks! I've done it. It is just like common installing any other package right? Anyways, it's working thanks again.

      Satya

      Reply Like
Like Follow
  • 22 hrs agoLast active
  • 6Replies
  • 41Views
  • 5 Following