03-07-2023 02:33 PM
Before anyone types "it depends" know that I'm a consultant and claim that answer for myself 😉
I am interested to know if there are any measurements which can help me determine a strategy for doing incremental builds as my routine, and only doing full loads as runtime performance dictates. I'm predominantly on Incorta cloud so kind of don't care if I clutter up a directory w/ 1 or 1000 files, but do care if I watch the "wheel spin" when I open dashboards.
I don't know enough to know if this should be based on number of parquet files, size of the files, a combination of both, or some other factor(s).
Hoping the fine folks in product development, support, or services have done some benchmarking and can provide some guidance.
01-31-2024 11:34 AM
Incremental logic will be executed by the source database. If you just add simple where clause like, "LAST_UPDATE_DATE > ?", it may be fine. If the source extraction query involves multiple tables and you use OR to connect multiple filters like
WHERE TB1.LAST_UPDATE_DATE >?
OR TB2.LAST_UPDATE_DATE >?
OR TB3.LAST_UPDATE_DATE >?
The database may not perform well on such query and thus become negative impacts.
When Incremental refresh will extract a large volume of the data that need to be merged, Incorta may work better by doing a full refresh.
The incremental logic may make the table fragmented. When the refresh is scheduled with a very high frequency and thus produce many small files. It will have negative impact to the downstream process and thus a parquet merge tool may have to be used.