How to prevent whole tables from being loaded?

Posts   
 
    
Posts: 14
Joined: 12-Dec-2005
# Posted on: 29-Jan-2008 22:44:25   

Background: We are building an batch processing application which will be responsible for iterating through several hundred thousand to a couple of million insurance claims. Performance is of the highest priority, since even one extra database call per claim could mean half a million extra database calls over the whole run.

Currently, we grab a couple thousand claims into memory at a time and process those. Each of these claims has several related entities which we need to examine.

The Problem When I create a prefetch path to get all the related entities, LLBLGen produces subqueries which return the entire table. I understand that LLBLGen creates the subqueries differently if the number of entities is low enough (50, I believe). Since I am dealing with several thousand claims at a time, the select statement gets the whole table instead of the rows that are based upon the primary key of the parent table. This is unacceptable since those related tables could contain millions of records, and that would kill memory, etc.

Am I doing something wrong? Here is what my prefetch paths look like:

            
IPrefetchPath2 prefetchPath = new PrefetchPath2((int)EntityType.CmcClclClaimEntity);
prefetchPath.Add(CmcClclClaimEntity.PrefetchPathCmcGrgrGroup).SubPath.Add(CmcGrgrGroupEntity.PrefetchPathCmcGrreRelation);
prefetchPath.Add(CmcClclClaimEntity.PrefetchPathCmcCddlClLine).SubPath.Add(CmcCddlClLineEntity.PrefetchPathCmcCddcDnliCob);
prefetchPath.Add(CmcClclClaimEntity.PrefetchPathCmcCdmlClLine).SubPath.Add(CmcCdmlClLineEntity.PrefetchPathCmcCdcbLiCob);
prefetchPath.Add(CmcClclClaimEntity.PrefetchPathCmcClstStatus);

and here is what the resulting sql calls look like:


SELECT TOP 200 <SNIP> FROM [dbo].[CMC_CLCL_CLAIM]

SELECT <SNIP> FROM [dbo].[CMC_GRGR_GROUP]  
WHERE ( [dbo].[CMC_GRGR_GROUP].[GRGR_CK] IN (
SELECT [dbo].[CMC_CLCL_CLAIM].[GRGR_CK] AS [GrgrCk] FROM [dbo].[CMC_CLCL_CLAIM]))

The rest of the related queries...


The second table could contain millions of rows, so I obviously don't want to bring that whole thing back every time I get the next batch of queries. Is there anything I can do here? I'm using 1.0.2005.1, Adapter Scenario.

Thanks, Matt Hornsby

Walaa avatar
Walaa
Support Team
Posts: 14995
Joined: 21-Aug-2005
# Posted on: 30-Jan-2008 09:39:31   

This issue was reported here: http://www.llblgen.com/TinyForum/Messages.aspx?ThreadID=8904

And it was solved in earlier builds of v.2.5 But it was reverted back to the original behaviour, please check the following thread: http://www.llblgen.com/tinyforum/Messages.aspx?ThreadID=11159

Is there anything I can do here?

I recommend you process entities in the order of hundreds rather than thousands. 100 at a time.

And by setting the ParameterizedPrefetchPathThreshold to an appropriate value the prefetchPath query can use an IN predicate to filter the related rows upon the PK of root entities, rather than using a SubQuery.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39908
Joined: 17-Aug-2003
# Posted on: 30-Jan-2008 11:21:14   

Indeed, with the threshold you can tune the query to use the parameterized version, so in your case I recommend to start with a value of 200 or so and do the batching with pages of that size. Prefetch paths and paging require you to use pages of size below the threshold so use that threshold to tweak the queries. The value of '50' is set as a very conservative value, as tests have shown that in general cases 50 is a good norm when parameterized IN queries get slower and subqueries get faster. However if the subquery kills performance otherwise, in your case it might be that a threshold of 500 is still much faster. Sqlserver has a limit of 2000 or so, so as long as you stay below 1500 or so, you're good.

Even if a query selects the entire table, it doesn't mean the entire table is loaded. The datareader gets a batch at a time and reads till it has enough entities read, for example if you have specified a limit.

In your case I definitely would look into using the threshold to tweak the query and use smaller batches, e.g. batches of 500 or 1000 max. using paging.

Frans Bouma | Lead developer LLBLGen Pro
Posts: 14
Joined: 12-Dec-2005
# Posted on: 30-Jan-2008 22:21:08   

Thank you both for your replies. I was initially hesitant to use the threshold because of some architectural considerations in my codebase . I have a base manager class that other managers inherit from and it exposes the DataAccessAdapter as the IDataAccessAdapter interface which doesn't have the ParameterizedPrefetchPathThreshold available.

Due to the sheer volume of data that I am dealing with, it could potentially take a lot longer to run the entire batch if I bump down the paging amount. What I was thinking of trying was to have a status indicator on the parent table which I would update just prior to selecting everything. Basically, update a batch of rows with this indicator, and then select all rows and their related records with that indicator set. Then that could be used in a FilterExpression which would pass down through all the related subqueries.

I just got done throwing something together to do it that way, and it seems to be quite a bit quicker.

Thanks again for your help and inspiration!

Matt