Use a notebook to manipulate a file before ingesting data?
We have several .csv ( or Excel ) files which have things like subtotals, title rows w/ merged cells, and other things non-conducive to using as an Incorta data source.
Is is possible to use a notebook to access the uploaded file, delete rows we don't want ( e.g. every row labeled "subtotal" ), strip out header rows which aren't headers or data, and then use the file as a data source?
Is this task only needed for one time only? If the files are uploaded periodically and you would like to perform this task as part of incremental refresh, you can consider creating a materialized view for performing this task.
Although it is not officially supported, PySpark MV can access the data file from the disk using those python file access API. Incorta PySpark MV does require a save(<data frame>) in the end. Also, Incorta runs the MV logic during saving the materialized view, not just when you run it explicitly or schedule to run it. The files you save from the MV will be generated each time you hit save. Incorta Notebook is launched from the screen when you create a MV. It is not, as of the current release (release 5), a generic purpose of notebook.
Alternatively, you can use any python program including the one you create from a notebook outside Incorta to manipulate files before the file is ready to consume by Incorta. During incremental refresh, Incorta can pick up the files from the <tenant folder>/data folder or you can use the Data Lake - Local Files feature to access a file that can be accessible from Incorta loader service process outside the tenant folder.