0

Incremental load

Hello All,

Can any one help me to understand how can we perform the incremental load in Incorta for the tables which does not contain any date columns such as Last Updated Date etc.

8replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • do you have any incremental column that gets generated based on data loaded. 

    Reply Like
  • Hi Praveen Mantha

    The source system would have to have some change data capture mechanism (CDC).  Most commonly, change data capture is accomplished through a Date based column (e.g. LastUpdatedDate, etc.).  In other cases, you may have other columns, like version_num, that can be leveraged.  Sometimes the source system have "audit tables" which track which records (by primary key) have changed on a given table.  The SQL SELECT statements that define your incremental queries for your Incorta tables would have to be written accordingly based on your source system's CDC approach.  If your source system doesn't have a way to detect changes, you're basically in the same boat as you would be with any other ETL tool.

    Please let me know if there are other ways to detect changes in your source system and we can tailor our response from there.

    Thanks,
    Dan

    Reply Like 1
  • For SQL / JDBC based data source, Incorta gives you two blocks for entering the data extract logics after Incremental Refresh is enabled.  You can put any logic into the extraction logic for incremental data as long as the data can be identified as an update to existing data via the key defined for that table.

    In some cases, it is not easy to find the changed data via Last Update Date, but we can identify the data set that could be changed from the data set that could never be changed.  It may worth to define the filter as an incremental to avoid extract the whole data set in every incremental refresh.  

    Incorta performs the merge of the delta generated from the incremental logic and the data previously generated in loading data into Incorta In-Memory store and in the compaction process.  When we go for the approach of having the data set extracted in incremental logic overlapping with existing data, we need to be careful about the cost of processing such data.

    Incorta does track the last refresh time and allow you to use ? as part of the incremental extract logic, but it is not mandatory to use the question mark ? in the SQL.

    Reply Like
  • Hello, I've got the same question as the original post.  From the above responses, it looks like it's possible, but 'how' is it done?  In my example I'd have a field (called CheckSum) that would contain a key that would only change when a unique row had some data on it change.  

    I would want the incremental load to load any row that has a CheckSum value that doesn't already exist on the table.  How would this be done?

    Reply Like
    • Mike Mascitti -- I have never employed this, but, in theory, if you have a column on your table defined as md5(col1 || col2 || col3) and that column only gets set *after* an incremental load, then you do something like this for your incremental load query

       

      SELECT * FROM <table> WHERE md5col != md5(col1 || col2 || col3)

       

      In the case where new records have been inserted, md5col will be null.  In the case where col1, col2, or col3 have been updated, the md5s won't match and you will retrieve the update records.  You will have to execute something (like a stored proc) after your successful incremental load to set the md5col field on your table.

      Reply Like
  • Hi Dan, thanks for the response.   I've tried the following and gotten a couple different errors.  

    In your example, when I do this in an incremental load...

    Select field1, field2, field3 from [MyDataSource].[MyDBTable] where md5col != md5(field1 || field2 || field3) 

    it fails with the error "INC_005003001:Failed retrieving sample data from [MyDataSource] with properties [[incremental, true], [fetchsize, 5000], [timestampFieldType, DATE]]   However, the log file says "Invalid column name 'md5col' "

    This makes sense if it's looking for this column on [MyDataSource].[MyDBTable] as the column doesn't exist there, it's only on the table in Incorta (and it's populated and visible when exploring the data).  If I try to tell the query to look at Incorta for the md5 column, it looks like this...

    Select field1, field2, field3 from [MyDataSource].[MyDBTable] where [MyIncortaSchema].[MyIncortaTable].[md5col] != md5(field1 || field2 || field3) 

    However, that returns the same error on the UI, but in the log it says "MyIncortaSchema.MyIncortaTable.md5col could not be bound'.  which looks like it's trying to find it on the data source, not within Incorta.  

     

    Is it possible to have the incremental query access both the data source and the Incorta table it's going to load?  I'd guess yes, since that's how it works with the 'Last Modified Date'.   Do you know what I'm doing wrong?  

    Thanks for your help,

    Reply Like
    • Mike Mascitti Hi Mike, I am not sure I am following 100%.  The approach I spell out above was based on tables and columns in your source system only (not based on any Incorta tables).  Thus, you'd absolutely need to create a source system column called md5col.  If you want to build this on Incorta tables, I think we'd need a separate session to understand why and then dig deeper into your needs and the specifics of that solution.   Happy to connect directly if it makes sense.

      Reply Like
  • Couple other solutions  to the original question - 1) Use a database trigger on that source table to capture the keys of the row getting updated and a time stamp , then use that table to join to the primary table in the Incorta sql extract query .  This will allow incremental update of rows . Just make sure that the source application has no performance impact .

    2) Use a tool like golden gate or similar  which can reads the logs and  write the keys  to a table for  source rows which  got inserted or updated or deleted . Now use that table with the primary table to do incremental upserts.

    Reply Like
Like Follow
  • 3 wk agoLast active
  • 8Replies
  • 235Views
  • 6 Following