1

How py-spark works on materialized view? why only python and sql are used while scripting mv?

6replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Materialized Views are computed via Spark (on top the Incorta parquet files). To run them, developers would need to submit Spark jobs that would undertake the required logic. Python and SQL are widely used among BI developers, database admins and data scientists. It's much easier to write your job in those languages (as opposed to Java or Scala for example).

     

    The workflow of a materialized view is simple:

    - User/developer submits the MV script (whether Python or SQL)

    - The script is encapsulated in an Spark job, and gets submitted to Spark (on load events)

    - The result is written back to a parquet file stored inside Incorta tenant directory

    - Now Incorta can treat the MV as a normal table

    - Upon another load event (whether full or incremental), the appropriate script will be resubmitted to Spark

    Like
  • Ahmed Moawad is there any learning material from incorta  for Pyscript which will help us to learn ? 

    Like
    • Delli It's pretty much classic PySpark scripts, nothing special about it. The only thing specific to Incorta that I know of is two helper methods we have predefined:

      • read(TABLE_NAME): which will read the corresponding parquet file
      • save(DATA_FRAME): which will write the data frame (result set) into a corresponding parquet file

      So, your typical MV will be something like this:

      df = read('HR.EMPLOYEES')
      # your business logic goes here
      save(resultDf)
      
      Like 1
    • Ahmed Moawad I have a similar requirement where I am pulling data from CRM cloud, I am able to get it in Pandas dataframe but save(all_created_data) is giving me below error, I am not using numpy in my code, does this have anything to do with save() method ?

      INC_005005001:Failed to load data from [spark://s00186vmeinc2pd.vpc.starbucks.net:7077] with properties [[error, got all data 'numpy.dtype' object is not iterable ('TypeError', ':', TypeError("'numpy.dtype' object is not iterable",)) ]]

      Like
    • Mukul Ranjan the save method only accepts Spark dataframes. It doesn't accept Pandas dataframes.

      Like
    • Mukul Ranjan 

      The reason why you are seeing that error is because you are trying to save the pandas dataframe in the MV. You need to convert the pandas dataframe to spark dataframe before you save it. Here is how you can do it:

       

      # Convert all_data to spark dataframe

      df=spark.createDataFrame(all_data.astype(str))

      #save the spark dataframe

      save(df) 

      Like
Like1 Follow
  • 1 Likes
  • 1 mth agoLast active
  • 6Replies
  • 253Views
  • 6 Following