topic Re: Upper limit on datasets for a table ( and how are the processed ) ? in Administrative Discussions

Upper limit on datasets for a table ( and how are the processed ) ?

RADSr — Thu, 22 Jun 2023 18:27:57 GMT

I have a scenario in front of me where I have a bunch of identical ( columns ) Incorta tables -- i.e. I have a table for each month of the year for the past 2+ years which I want to consolidate into one big table w/ all the data inside.

Monthly tables: Pulled from original source. Broken into monthly tables because they return exponentially faster that way.

Consolidated table: IncortaSQL apparently does not support doing a big UNION or UNION ALL and we were getting memory errors using SQLi with the big UNION statement.

What *IS* working is having all the month tables as separate datasets via SQLi - and it's pretty fast!

That said, we must remain optimistic and assume the calendar will not stop.

So we will need to keep adding month tables as we go ( we have 2023 covered, w/ July forward obvs not having any data )

So - finally - the question: Is there a limit to the number of datasets which Incorta will support?

Corollary question - how are these processed? Does Incorta send 36 requests ( to itself in this case ) at the same time? Is that governed by the parallel job tasks config in the CMC? I can't imagine they are all done in series, although there must be some staging somewhere w/in the mix because I am having Incorta enforce primary key constraints.

I'm very excited about this - give me some good news Community members! 😉

Screen cap - 3 of 30! datasets ( 2021 combines Jan-Jun as they were relatively low volume months )

Re: Upper limit on datasets for a table ( and how are the processed ) ?

PatrickRafferty — Fri, 23 Jun 2023 15:55:16 GMT

I'll answer the corrolary question.

The "staging" or primary key enforcement happens during the load/post-load phase so there should be parallelization across the different datasets and the configuration of the loader node (# of CPUs dedicated to loader, etc.) and parallelization in CMC should drive that.

However, because the queries themselves are hitting the Incorta Analytics service....the SQL interface/Analytics node could decide to queue up query execution on that end even though the client aka its own loader service may have issued a set of queries (8, 12, 16, whatever) in parallel.

I don't know of any limit on datasets per table (probably a theoretical one?) so maybe someone from Engineering can weigh in. Fun fact: the first Proof of Concept I ever did @ Incorta (2017?) was 50 different manufacturing instances (1 per factory) that we poured into a single schema/table structure so it's definitely built since the beginning to handle this type of use case.

Re: Upper limit on datasets for a table ( and how are the processed ) ?

PatrickRafferty — Fri, 23 Jun 2023 16:00:06 GMT

And here's the parallelization in action with OnlineStore (quickest example I could pull up):

So, query the largest table (salesorderdetail) 3 different ways making the 9 million row table into a 27 million row table.

And that extracts in 41 seconds:

And a single query to that table extracts in 30 seconds:

Re: Upper limit on datasets for a table ( and how are the processed ) ?

anurag — Fri, 23 Jun 2023 20:18:32 GMT

@RADSr I am assuming your historical data (data for previous months) is also getting updated that's why you are doing full load where you load data for all the months using the approach you described.

If that is not the case and data for previous months do not change, then there are other ways to manage this load.

Re: Upper limit on datasets for a table ( and how are the processed ) ?

AhmedM — Sun, 25 Jun 2023 07:41:41 GMT

The multi-source table is a good solution in this case, but is there a reason you must extract them in separate Incorta tables then glue them using a multi-source table through SQLi? I believe it will be better if you create a multi-source table where each dataset extracts from source directly, this way, you reduce one step (re-extraction through SQLi) which will save cluster resources.

As for the limit of number of data sets, I believe there's no limit. I have seen customer use cases where there were 25+ data sets in a table and I've heard of 60. Technically there's no limit. But you have to make sure the loader service has enough resources to execute.

When you load a table with multiple data sets, the loader will try to extract the data set in parallel. This is governed by a thread pool for all data sets being extracted across the load job, which may contain other tables. So, expect parallel execution of the queries against your source.

In the CMC, you can set the number of parallel load jobs, but this is for the whole load jobs (the ones you set in the scheduler), which are usually schemas. The capacity of the thread-pool used to extract tables and data sets depends on the CPU Utilization percentage you set for the loader service. The higher the percentage, the more tables (and data sets) can be extracted in parallel.

Re: Upper limit on datasets for a table ( and how are the processed ) ?

RADSr — Tue, 27 Jun 2023 23:41:21 GMT

TL/DR - making these modular will give me more flexibility and increase my ability to move corrective changes into the most used dashboards in the shortest time possible.

------------

That said, the primary reason for staging into Incorta first is to meter/limit the activity against the source DB . The source is a view on a production box. The view is ( I gather ) pretty heavy and is completely "black box" to me, so I can't do any tuning or even recreate it w/in Incorta. The short is, because it's production I don't want to be a burden on the system which may affect operations.

Currently I have three schemae ( 2021, 2022, 2023 with 7, 12, and12 tables respectively ) which I can run in series or independently and which can be further broken down if need be.

So - use case: A change is made to move a product to a different category and that change needs to be reflected ASAP for the WTD/MTD/QTD/YTD reporting I can choose to run only the tables necessary for the immediate need *now* and be running correct dashboards very quickly while correcting further history at my schedule.