Connecting to Data

RichC · ‎03-08-2022

Introduction

Incorta has a wide variety of external data source connectors. A connector specifies how Incorta can connect to an external system or application , ingest data and publish to destinations. Incorta includes many out-of-the-box connectors for data ingestion, please refer to this page for more details on the types of connectors supported.

What you should know before reading this article

We recommend that you be familiar with these Incorta concepts before exploring this topic further.

Applies to

These concepts apply to all releases of Incorta.

Let's Go

Managing Data Sources

Data Sources in Incorta are the connections to source systems. By default, Incorta schema managers can have the right to create a data source that can connect to a source system. However, since data sources, like any Incorta objects, are created as a private object by default, letting individual schema managers create their own data sources in an environment could be problematic and lead to integration issues in the future. A recommendation is to secure data sources and grant to schema managers with View permissions only
Data source names are referenced in schema definitions at the table level. Please use a name that describes the content, not the instance of the data source. For example, we may have multiple EBS systems, for UAT, production, for backup, for unit testing, etc. If you create each of them as a separate data source, and reference their name in the schema definition, you will have to change them before or during migration.
When connecting to a database, please consider only granting the required permissions to the database user. In general, the update permission is not required. Executing stored procedures may be required if we are using the stored procedures in the extraction or external session variable logic.
Read or SELECT access can be granted to the specific schemas or tables needed by Incorta.
If the testing or development databases are used in developing Incorta content, please ensure that the same database users are available in the production environment with the same permissions granted.

Parallel Extraction of very large source tables

Chunking: Incorta chunking is a table data source configuration that allows for data extraction using parallel sql queries. Parallel execution helps extract rows faster from very large tables. This is handled at a table level where we can specify the chunk size for a single table extract. You should have enough database connection pool to accept multiple queries running against the database source. Please refer to this article for further details.
Spark Based Extraction: This option uses spark to do parallel extraction and can show significant full load performance benefit if a tables has in excess of hundreds of millions of rows. You should have sufficient workers to run many extraction queries at the same time.

Data Agents

The data agent (DA) is a lightweight application that runs on a server that has access to the source DB, on premises or in the cloud, and makes an outbound connection to the customer Incorta Cloud instance. Please ensure that the data agent is placed in the same server or the same network as the server on which the source database is running. Please refer to this article on how to install/configure a data agent.

Data lakes: Setting up Incremental Tables with Data Lake

Data lakes usually receive updates in the form of new increment files added to an existing directory. Incorta's data lake connector leverages that via the “Wildcard Union” option.

User Case 1

Use the "Wildcard Union" option and the "Last Successful Extract Time" incremental strategy.

A directory containing all the sales data is located at /path/to/sales
The directory contains the following files: /path/to/sales/sales_california.parquet, /path/to/sales/sales_newyork.parquet, /path/to/sales/sales_chicago.parquet, …
You can create a table using a data lake connector, point it to the directory and enable the “Wildcard Union” option
To set up incremental refreshes, enable the incremental flag and choose the “Last Successful Extract Time” incremental strategy

Now when the user triggers a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, for example /path/to/sales/sales_ohio.parquet, the next incremental load triggered will pick up this file since its last modified timestamp will be more recent than the files extracted in the previous full load event.

User Case 2

Use the "Wildcard Union" option and the "Timestamp in File Name" incremental strategy.

A directory containing all the sales data is located at /path/to/sales
The directory receives a new file for sales on a daily basis: /path/to/sales/sales_2020-04-01.parquet, /path/to/sales/sales_2020-04-02.parquet, /path/to/sales/sales_2020-04-03.parquet, …
You can create a table using a data lake connector, point it to the directory and enable the “Wildcard Union” option
To leverage the date identifier in the file name, set up incremental refreshes using the “Timestamp in File Name” incremental strategy and define the timestamp pattern that will appear in the file names

Now when the user triggers a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, for example /path/to/sales/sales_2020-04-04.parquet, the next incremental load triggered will pick up this file since the date part in its name is more recent than the latest file file previously loaded.

Other Useful Options

Include Pattern

If the directory contains heterogeneous files and you want to pick up files that only start with a certain prefix you can use the include field to define something like: sales*.parquet. Specifying this will have the load events consider only files that start with the word sales and end with .parquet.

Include Sub-Directories

If the directory contains an arbitrary hierarchy of sub-directories and you would like to load all files within the hierarchy, just enable the Include Sub-Directories flag, this will respect the Include pattern if specified.