Handling Deletes

dylanwan · ‎03-08-2022

Introduction

When loading data, Incorta can insert new records or update existing records that already have been loaded but it does not delete records that have already been loaded. Your source systems, however, may allow users to delete data from the database. After the data are deleted, those records may still exist in Incorta unless we handle the situation.

There are a number of strategies that can be implemented depending on the size of your data and whether the deleted records can be easily identified.

What you should know before reading this article

We recommend that you be familiar with these Incorta concepts before exploring this topic further.

Applies to

These concepts apply to all releases of Incorta.

Let's Go

Running Full Load

When a schema table is defined without incremental logic, full load logic will be applied on incremental refresh. The data in incorta will be truncated and reloaded in full. Joins from child tables and to parent tables will be recalculated. Formula columns will also be recalculated. Since the the table is rebuilt as a result of the full load, deleted records from the source will no longer be reflected in Incorta.

No special schema model design is required in Incorta to handle deletes in the source when using full loads.

Running full load is the easiest approach to managing data deletes but when the size of the table is large, incremental refresh may take a long time. Data sync from the loader service to analytics will also take a long time. This option should be used only when the size of the data is small or the incremental refresh is not frequent.

Capture the Delete Event in the Source system

If executing full reloads is not practical, then deleted records can be identified in the source system and the deleted records can be extracted into a skinny table with the primary key of the table only,

Various ways can be used to identify deleted records in the source system. Here is one recommended approach:

Use Database Trigger

You will need the ability to create a new table and trigger in the source database. The method is the following:

Create a key-only table in the source database with just the key column(s) for the table you need to handle deletes in. Add a DELETED_ON column with a timestamp data type.
Create an On Delete or After Delete trigger on the source table that writes out the key(s) to the keys table created above. Set the DELETED_ON column value to the current timestamp at delete.
Create a key-only table in Incorta for this narrow keys table from this source database. Set up incremental loads on the table containing the DELETED_ON column.
Make the key-only table a parent table to the base table
Apply a Runtime Security filter on the Key column with a condition: isNull(Test.parent_table_with_keys.id)

Use Database Log

Some databases may provide a delete log and we can load that data into Incorta as well to populate the key-only table.

Identify the deleted records in Incorta

If a source delete log is not available and creating tables and database triggers is not allowed, we can use Incorta to identify the deleted records.

Create a key-only table and extract the data in a full load
Make the key-only table a parent table to the base table that may have deleted records by creating a join from the base table to the key only parent table
Apply a runtime security filter on the key column from the parent table with the IS NOT NULL condition:
not(isNull(Test.parent_table_with_keys.id))
Schedule a periodic full load of the base table (for example run it weekly) to flush out the deleted records

This design assumes that the full load of the key-only table is running much faster than running a full load for the base table. Otherwise, a simple solution can be just running full load for the base table.

Limit the size of the full extract of the deleted key columns

When the base table is huge and extracting even just the key column takes awhile, a possible workaround is to limit the number of records that are extracted from the source into the key-only table by setting the extract since date. This approach assumes that the old data will not be deleted and uses that assumption to define the range of data to be scanned.

When a parent and child relationship exists

The approaches described above are applied on an individual tables basis. The runtime security described above is only applied during runtime on the driving table. When there are deleted records in the parent table and the child records that joined under those deleted parent need to be filtered out or deleted, here is an alternate approach:

All tables with the associated key-only table will have a DELETED formula column

if(
isNull(
<SCHEMA>.<TABLE NAME>_PK.ID
),
'Y',
'N'
)

2. All tables with the associated key-only table and the DELETED formula will have a runtime security rule on the DELETED column

<SCHEMA>.<TABLE>.DELETED = 'N'

3. Parent to Child Cascade Delete will be implemented in the runtime security filter with the parent table DELETED column as well.

and(
<SCHEMA>.<TABLE>.DELETED = 'N',
<SCHEMA>.<PARENT TABLE 1>.DELETED = 'N',
<SCHEMA>.<PARENT TABLE 2>.DELETED = 'N'
)

In this approach, the key-only table will not participate in any join directly and will not alter the query plan.

Please note that the deletion filter should be driven by the reporting requirement and defined individually. For example, when a supplier record is deleted or inactive, it may simply mean that no more transactions should be created against the supplier, but the existing historical transactions may still valid. The DELETED flag may not need to be cascaded to the transaction table.