Missing DISTINCT clause

Posts   
 
    
maf123sp
User
Posts: 26
Joined: 09-Jan-2013
# Posted on: 18-Jan-2017 09:43:40   

I have the following snippet: var qf = new QueryFactory(); var q1 = qf.TabA // .Select(TabAFields.ParamRef) // .Distinct(); Factory.DataAccessAdapter.DeleteEntitiesDirectly(typeof (TabBEntity), new RelationPredicateBucket(TabBFields.Ak.NotIn(q1)));

I expect to get finally the SQL query: DELETE FROM "TAB_B" WHERE ((NOT "TAB_B"."AK" IN (SELECT DISTINCT "TAB_A"."PARAMREF" AS "ParamRef" FROM "TAB_A")))

But the DISTINCT clause is missing.

Now I looked into my query in detail and found, that the following command is generated the non-expected code: var entities2 = Factory.DataAccessAdapter.FetchQuery(qf.TabBWhere(TabBFields.Ak.NotIn(q1))); SELECT "TAB_B"."AK" AS "Ak", "TAB_B"."NAME" AS "Key" FROM "TAB_B" WHERE ((NOT "TAB_B"."AK" IN (SELECT "TAB_A"."PARAMREF" AS "ParamRef" FROM "TAB_A")))

Is there something I do wrong or is this a bug in LLBLGen Pro. I'm using the version 4.2.

(Background: there is a huge difference in the performance of the final query on DB side.)

Thanks for your feedback.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 18-Jan-2017 10:50:51   

The difference in speed isn't due to the missing distinct (this is sqlserver, but I doubt it's different on another database like oracle): http://www.sqlservercentral.com/blogs/sqlinthewild/2011/01/18/distincting-an-in-subquery/

Likely the one without distinct ran first, pulled the data from disk, then the one with distinct ran on hot data.

The fieldl.NotIn(query) method creates a FieldCompareSetPredicate and it simply reuses the field in the projection, ignoring the distinct (as it's not needed).

What can speed up things a lot is adding an index on "TAB_A"."PARAMREF". Distinct doesn't help, as it has to read the whole set of values anyway: applying distinct over that only causes more performance loss as the distinct operation does take time and isn't needed for the IN clause.

Frans Bouma | Lead developer LLBLGen Pro
maf123sp
User
Posts: 26
Joined: 09-Jan-2013
# Posted on: 18-Jan-2017 13:31:08   

Thanks for your answer. - but I completely disagree. (I'm using ORACLE)

In my case the DB optimizer is using a FULL TABLE scan in case without DISTINCT: As soon I have the DISTINCT clause the optimizer changes to use the existing index on PARAMREF and can switch to a FULL INDEX SCAN. Depending on the number of distinct entries this can be an extrem difference - in my case a diferent of costs 800/1 and in memory 5000:1.

This distinct clause is a very important tool for query optimization.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 18-Jan-2017 15:00:06   

maf123sp wrote:

Thanks for your answer. - but I completely disagree. (I'm using ORACLE)

In my case the DB optimizer is using a FULL TABLE scan in case without DISTINCT: As soon I have the DISTINCT clause the optimizer changes to use the existing index on PARAMREF and can switch to a FULL INDEX SCAN. Depending on the number of distinct entries this can be an extrem difference - in my case a diferent of costs 800/1 and in memory 5000:1.

hmmm... so without the distinct it doesn't use the index? that's pretty silly, but not something you can change, I guess.

This distinct clause is a very important tool for query optimization.

... for when the 'optimizer' is away on vacation, you mean? wink But Ok, in situations like this, you probably have to.

The particular method you're using uses an optimization here where it finds the query doesn't need wrapping and simply passes on the individual fragments. This skips the Distinct.

What you can do is the following (this is allowed btw, so no licensing issues here): copy the FieldCompareSetPredicate.cs class to your own project. Rename it to a different name, e.g. FieldCompareSetWithDistinctPredicte.cs.

Add a boolean property: UseDistinct.

In the ToQueryText method of the class, you'll see:


IRetrievalQuery subQuery = this.DatabaseSpecificCreator.CreateSubQuery(setFieldList, setFieldPersistenceInfo, _setFilter, _maxNumberOfItemsToReturn, _setSorter, _setRelations, _groupByClause);

Change it to:


IRetrievalQuery subQuery = this.DatabaseSpecificCreator.CreateSubQuery(setFieldList, setFieldPersistenceInfo, _setFilter, _maxNumberOfItemsToReturn, _setSorter, _setRelations, _groupByClause, this.UseDistinct);

In your query, use it like:


Factory.DataAccessAdapter.DeleteEntitiesDirectly(typeof (TabBEntity), new RelationPredicateBucket(
new FieldCompareSetWithDistinctPredicate(TabBFields.Ak, null, TabAFields.ParamRef, null, SetOperator.In, null) { UseDistinct = true, Negate=true}));

It's a little lowlevel, but it gets the job done. I'll add a workitem to v5.2 to honor Distinct in this situation, as it's currently a problem on Oracle.

Frans Bouma | Lead developer LLBLGen Pro
maf123sp
User
Posts: 26
Joined: 09-Jan-2013
# Posted on: 18-Jan-2017 16:06:48   

Thanks for your fast response!

The optimizer follows the statistics of the table. In my case there are about 50 entries in table Tab_B and more than 1 million in table TAB_A - which means, the table TAB_A has 1 million entries and the index of PARAMREF has on level 1 only 50 entries. This forces the optimizer to use only the index for the scan. Since I use only the indexed value the RDBMS doesn't have to access the table itself - all data are already contained in the index entry.

Without the DISTINCT all values of the inner query might be important - and therefore the optimizer is using the FULL TABLE scan and join these values with the seconds table (memory, costs).

I like your solution - this is what I have seen always from your side. Perfect and thanks a lot for your help!!!

I'm using LLBLGEn for several years know, and this is the best ORM framework I've even seen.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 18-Jan-2017 17:03:05   

maf123sp wrote:

Thanks for your fast response!

The optimizer follows the statistics of the table. In my case there are about 50 entries in table Tab_B and more than 1 million in table TAB_A - which means, the table TAB_A has 1 million entries and the index of PARAMREF has on level 1 only 50 entries. This forces the optimizer to use only the index for the scan. Since I use only the indexed value the RDBMS doesn't have to access the table itself - all data are already contained in the index entry.

Without the DISTINCT all values of the inner query might be important - and therefore the optimizer is using the FULL TABLE scan and join these values with the seconds table (memory, costs).

Ah ok simple_smile

I like your solution - this is what I have seen always from your side. Perfect and thanks a lot for your help!!!

I'm using LLBLGEn for several years know, and this is the best ORM framework I've even seen.

Thanks for the kind words smile Glad it's solved! simple_smile

Frans Bouma | Lead developer LLBLGen Pro
Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 28-Feb-2017 12:41:40   

Implemented in v5.2

Frans Bouma | Lead developer LLBLGen Pro