Hi Praveen Mantha
The source system would have to have some change data capture mechanism (CDC). Most commonly, change data capture is accomplished through a Date based column (e.g. LastUpdatedDate, etc.). In other cases, you may have other columns, like version_num, that can be leveraged. Sometimes the source system have "audit tables" which track which records (by primary key) have changed on a given table. The SQL SELECT statements that define your incremental queries for your Incorta tables would have to be written accordingly based on your source system's CDC approach. If your source system doesn't have a way to detect changes, you're basically in the same boat as you would be with any other ETL tool.
Please let me know if there are other ways to detect changes in your source system and we can tailor our response from there.
For SQL / JDBC based data source, Incorta gives you two blocks for entering the data extract logics after Incremental Refresh is enabled. You can put any logic into the extraction logic for incremental data as long as the data can be identified as an update to existing data via the key defined for that table.
In some cases, it is not easy to find the changed data via Last Update Date, but we can identify the data set that could be changed from the data set that could never be changed. It may worth to define the filter as an incremental to avoid extract the whole data set in every incremental refresh.
Incorta performs the merge of the delta generated from the incremental logic and the data previously generated in loading data into Incorta In-Memory store and in the compaction process. When we go for the approach of having the data set extracted in incremental logic overlapping with existing data, we need to be careful about the cost of processing such data.
Incorta does track the last refresh time and allow you to use ? as part of the incremental extract logic, but it is not mandatory to use the question mark ? in the SQL.Reply
Hello, I've got the same question as the original post. From the above responses, it looks like it's possible, but 'how' is it done? In my example I'd have a field (called CheckSum) that would contain a key that would only change when a unique row had some data on it change.
I would want the incremental load to load any row that has a CheckSum value that doesn't already exist on the table. How would this be done?Reply
Hi Dan, thanks for the response. I've tried the following and gotten a couple different errors.
In your example, when I do this in an incremental load...
Select field1, field2, field3 from [MyDataSource].[MyDBTable] where md5col != md5(field1 || field2 || field3)
it fails with the error "INC_005003001:Failed retrieving sample data from [MyDataSource] with properties [[incremental, true], [fetchsize, 5000], [timestampFieldType, DATE]] However, the log file says "Invalid column name 'md5col' "
This makes sense if it's looking for this column on [MyDataSource].[MyDBTable] as the column doesn't exist there, it's only on the table in Incorta (and it's populated and visible when exploring the data). If I try to tell the query to look at Incorta for the md5 column, it looks like this...
Select field1, field2, field3 from [MyDataSource].[MyDBTable] where [MyIncortaSchema].[MyIncortaTable].[md5col] != md5(field1 || field2 || field3)
However, that returns the same error on the UI, but in the log it says "MyIncortaSchema.MyIncortaTable.md5col could not be bound'. which looks like it's trying to find it on the data source, not within Incorta.
Is it possible to have the incremental query access both the data source and the Incorta table it's going to load? I'd guess yes, since that's how it works with the 'Last Modified Date'. Do you know what I'm doing wrong?
Thanks for your help,Reply
Couple other solutions to the original question - 1) Use a database trigger on that source table to capture the keys of the row getting updated and a time stamp , then use that table to join to the primary table in the Incorta sql extract query . This will allow incremental update of rows . Just make sure that the source application has no performance impact .
2) Use a tool like golden gate or similar which can reads the logs and write the keys to a table for source rows which got inserted or updated or deleted . Now use that table with the primary table to do incremental upserts.Reply