0

Use a notebook to manipulate a file before ingesting data?

We have several .csv ( or Excel ) files which have things like subtotals, title rows w/ merged cells, and other things non-conducive to using as an Incorta data source.

Is is possible to use a notebook to access the uploaded file, delete rows we don't want ( e.g. every row labeled "subtotal" ), strip out header rows which aren't headers or data, and then use the file as a data source?

4replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Is this task only needed for one time only?  If the files are uploaded periodically and you would like to perform this task as part of incremental refresh, you can consider creating a materialized view for performing this task.
    Although it is not officially supported, PySpark MV can access the data file from the disk using those python file access API.  Incorta PySpark MV does require a save(<data frame>) in the end.  Also, Incorta runs the MV logic during saving the materialized view, not just when you run it explicitly or schedule to run it.  The files you save from the MV will be generated each time you hit save.  Incorta Notebook is launched from the screen when you create a MV.  It is not, as of the current release (release 5), a generic purpose of notebook.

    Alternatively, you can use any python program including the one you create from a notebook outside Incorta to manipulate  files before the file is ready to consume by Incorta.  During incremental refresh, Incorta can pick up the files from the <tenant folder>/data folder or you can use the Data Lake - Local Files feature to access a file that can be accessible from Incorta loader service process outside the tenant folder. 

    Like
      • Chris Chen
      • PMsquare
      • Chris_Chen
      • 2 wk ago
      • Reported - view

      Dylan Wan in what directory are the uploaded data files?  With PySpark I couldn't find the directory using os.listdir

      (this is for an Incorta Cloud client)

      Like
    • Chris Chen We need to check with Incorta cloud via  Incorta support.  It is not published.  It may be subject to change later. 

      Like
  • Dylan Wan  - I created Request #12495 for this question.   Anything you can do to help prod the answer would be awesome!

    Like
Like Follow
  • 7 days agoLast active
  • 4Replies
  • 27Views
  • 3 Following

Product Announcement

Incorta 5 is now Generally Available