0

Upsert for Incremental Loading

Has anyone successfully done an upsert on a incremental load without timestamp? Maybe using a primary key or a check sum type function? I don't always have a timestamp in my data sets.

17replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Joseph - you can do this with a SQL hashbytes function.  It allows you to pick whatever fields you want and creates a hash key out of them.  Then just make the HashKey the primary Key on the incorta table when it loads and the upsert should work.  Here's an example with some made up data....

    This will create a field called HashKey (or whatever you name it) and any time either myFirstField, mySecondField or myThirdField changes Incorta will insert a new row.  If myFourthField changes, Incorta will update your current row that has the identical values for the first three fields.  

    Basically it creates a composite primary key.  So if you put ALL your fields in the CONCAT list you'll never update anything, you'll always insert a new row unless you get a completely identical row.

    Hope this helps,

    SELECT myFirstField, mySecondField, myThirdField, myFourthField,
      CONVERT(
          VARCHAR(32),
          HashBytes(
              'MD5',
              CONCAT(
                  myFirstField,
                  mySecondField,
                  myThirdField
              )
          ),
          2
      ) as HashKey
    Like 1
  • Thank you so much, but this will only change the current record. What about bringing the delta records that are now in the source table?

    Like
  • If I understand correctly, it sounds like you want to pull new rows without having a timestamp or anything that helps identify them as new.  If you can't put something in the query to only pull new rows (and it sounds like you can't if there isn't a time stamp) then you have to pull everything and let Incorta decide what's new based on what has a different 'Key' than already exists on the table.   

    That's my understanding of how incremental loading works though - maybe some other replies will help clarify.

    Like 1
  • Thanks, Mike. So if I understand you correctly, we leverage a primary key from the source table, call it "key" in Incorta and then leverage that in the incremental load even if it isn't a timestamp?

    Like
    • joseph. delpercio 

      To answer your question above...  Whatever field(s) you label as 'Key' in the load details tell Incorta how to define a unique row.  If you load a row in Incremental load that has the same 'Key' as an existing row then Incorta will overwrite that row.  If the 'Key' is new, then the row will be added to the end of the table.  This is how the Upsert works.  (The 'Key' also prevents you from ever having duplicate rows)

      Using  WHERE LastModifiedDate > ? in your Update's select query allows you to only pull the rows from the source you need to update/insert.  (Your 'Key' would be 'UniqueCustomerID' or whatever your source table uses.. it wouldn't be the time stamp).  It would still work if you left it off,  you'd just end up overwriting rows you don't need to and you may as well just do a full load.

      Dan's suggestion would require some changes to your source table but it should solve your issue.

      Like
    • Mike Mascitti  Dan Brock so for example, if my query looked like this (recordid is my key):

      Normal Query:

      SELECT

          RecordID,

          Date,

          JurisdictionID,

          Jurisdiction,

          EGMId,

          SerialNumber,

          SumBet,

          [GamesPlayedSum],

          ProgressiveRate,

          ResetRate,

          TheoreticalHold,

          LinkName,

          GameName,

          CasinoName,

          StatusID,

          Denom,

          StatusDesc,

          '0' as WinAmountSum,

          SMI_Number

      FROM

          BPSDWRPTV01.OracleDW.dbo.vwBillingData

       

      Would my Incremental query look like this?

      SELECT

          RecordID,

          Date,

          JurisdictionID,

          Jurisdiction,

          EGMId,

          SerialNumber,

          SumBet,

          [GamesPlayedSum],

          ProgressiveRate,

          ResetRate,

          TheoreticalHold,

          LinkName,

          GameName,

          CasinoName,

          StatusID,

          Denom,

          StatusDesc,

          '0' as WinAmountSum,

          SMI_Number

      FROM

          BPSDWRPTV01.OracleDW.dbo.vwBillingData

      where RecordID =?

      Ari Gnanam

      Like
    • joseph. delpercio No.  You'd need to create a column on your source system table called md5col.  This field would be populated using a formula that calculates the md5 checksum of a group of fields of your choice. (e.g. md5(serialNumber || EGMId || Jurisdiction) ).  This field would need to be updated in the source system periodically (nightly, after your Incorta incremental runs).  Then your incremental query would look like this..

      SELECT <fields> FROM <table> WHERE md5col != md5(serialNumber || EGMId || Jurisdiction)

      Like 1
    • Dan Brock most of the time we don't have access to our source system to create a column. We are readers of the data, so what is another way if you can edit a source table?

      Like
    • joseph. delpercio Incorta, or any other data extraction utility for that matter, will need to rely on the source system to identify change.   If your source system cannot provide means to identify change and/or you can't modify the source system to employ mechanisms to identify change, there isn't much a downstream tool can do that can only read data on its own. 

      Like
  • Hi  joseph. delpercio -- please reference this other Q&A thread on the community that Mike and I discussed while back https://community.incorta.com/t/x17nm3/incremental-load.

     

    Essentitally, I think you're going to have to have a new column (e.g. md5col) on your source system table which stores off the current state of the md5checksum.  Once you have that created then your incremental SELECT statement in Incorta can look something like this:

     

    SELECT * FROM <table> WHERE md5col != md5(col1 || col2 || col3)

    Like
  • Dan Brock  I get this error.

    Like
    • joseph. delpercio I can't see the query you issued.  Hard to suggest what the ERROR is from without that.

      Like
    • Dan Brock sorry about that:

       

      SELECT

          RecordID,

          Date,

          JurisdictionID,

          Jurisdiction,

          EGMId,

          SerialNumber,

          SumBet,

          [GamesPlayedSum],

          ProgressiveRate,

          ResetRate,

          TheoreticalHold,

          LinkName,

          GameName,

          CasinoName,

          StatusID,

          Denom,

          StatusDesc,

          '0' as WinAmountSum,

          SMI_Number

      FROM

          BPSDWRPTV01.OracleDW.dbo.vwBillingData

      where cast(RecordID ! as int)= cast(recordId as int)

      Like
    • joseph. delpercio This is not the query we were suggesting so I am not sure how to help you here.

      Like
  • Dan Brock  I will have to review this a bit more then, can I not do that type of incremental type in the incremental query? I made a recordid in a source table that I could control to test it. However, you can't make it a key in Incorta unless it is a int or string, so I made it an integer. I can change, but that still doesn't satisfy the timestamp requirement the incremental loader is looking for.

    Like
    • joseph. delpercio Keep in mind your table key does not have to be the same field you use in your WHERE condition.  Your table key could be a string of concatenated fields while your WHERE condition could be based on another field with a md5 checksum.

      Like
  • I hope I am not deviating from the main question.

    You can do incremental without a timestamp. 

    Ex:
    Normal Query: select record_id, record_type from table_a
    
    Update Query: select record_id, record_type from table_a

    and select "record_id" as a key column in Incorta UI. 

    This will capture delta i.e. if record_id does not present, it will insert the record,  if record_id is present, it will update the record. 

    But the catch is table size. If your source table size is too large then for every incremental load, Incorta scans the entire table and it will take time. 

    That is where we are using the timestamp to restrict the number of records Incorta scans. 

    Ex:
    Normal Query: select record_id, record_type from table_a
    
    Update Query: select record_id, record_type from table_a where last_modified > ?

     

    Now, Incorta is restricted to scan only a certain set of records and do Upsert operation.

    Like
Like Follow
  • 6 mths agoLast active
  • 17Replies
  • 81Views
  • 4 Following

Product Announcement

Incorta 4.9 is now Generally Available (GA)!!!