Join, projection and take

Posts   
 
    
Deividas
User
Posts: 44
Joined: 01-Oct-2010
# Posted on: 10-Nov-2010 21:35:34   

Hello. I have this query:

(from customer in metaData.Customer
join contact in metaData.Contact on customer.ContactId equals contact.Id
select new
{
    CustomerId = customer.Id
}).Take(10)

The generated sql for it is:

SELECT [LPA_L1].[CustomerId] 
FROM ( [DB].[dbo].[Customer] [LPA_L1]  
INNER JOIN [DB].[dbo].[Contact] [LPA_L2]  ON  [LPA_L1].[ContactId] = [LPA_L2].[ContactId])

As you can see, the generated sql doesn't have the TOP(10). This situation only occurs when join and projection are involved.

LinqSupportClasses: 3.0.10.1105 ORMSupportClasses: 3.0.10.1104

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39908
Joined: 17-Aug-2003
# Posted on: 10-Nov-2010 22:24:41   

Add .Distinct() to the query, as it doesn't do serverside take with this, as there are duplicates in the resultset (so Take(10) will return duplicates).

Frans Bouma | Lead developer LLBLGen Pro
Deividas
User
Posts: 44
Joined: 01-Oct-2010
# Posted on: 10-Nov-2010 22:52:45   

Otis wrote:

Add .Distinct() to the query, as it doesn't do serverside take with this, as there are duplicates in the resultset (so Take(10) will return duplicates).

Indeed adding distinct worked. But how can this query produce duplicates? customer.Id and contact.Id are primary keys. And why does it work as expected without projection then?

daelmo avatar
daelmo
Support Team
Posts: 8245
Joined: 28-Nov-2005
# Posted on: 11-Nov-2010 05:32:36   

When use projection and joins, LLBLGen the query indeed could returns duplicates, the ting is, this isn't know when you build the query.

David Elizondo | LLBLGen Support Team
Deividas
User
Posts: 44
Joined: 01-Oct-2010
# Posted on: 11-Nov-2010 09:02:54   

OK, let assume that a query does return duplicate results. In that case what is the logic behind removing the TOP(10)? The query would return 10 results with duplicates, which could be exactly what I'm expecting so why is that bad in any way?

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39908
Joined: 17-Aug-2003
# Posted on: 11-Nov-2010 10:13:00   

Deividas wrote:

OK, let assume that a query does return duplicate results. In that case what is the logic behind removing the TOP(10)? The query would return 10 results with duplicates, which could be exactly what I'm expecting so why is that bad in any way?

That's a discussion which has been held a couple of times, so I won't rehash it here simple_smile . Our framework from the beginning has been focusing on returning the data you are interested in. So if you do a Take(10) (or in our own api fetch 10 rows), you don't want 10 duplicate rows, but 10 rows. E.g. when you fetch customers with a filter on order, using a join with order, you get duplicate rows. Fetching the customer entities will filter out duplicates. This is done across the board, so also for projections.

The same thing is true for paging. If you page from 1 page to the other, you want actual data, not the same page with duplicate rows. As paging on the server side requires that the set which is paged has unique rows (otherwise it doesn't work), the query has to make sure there are unique rows. This requires Distinct.

This means that to do a server side limit, it has to be sure there are no duplicate rows. If distinct hasn't been specified (which is the case in a Linq query for example), and joins are in the query, it can't guarantee that the set has unique rows, so it will switch to client-side limiting and distinct filtering. Specifying Distinct() makes this work. Client-side filtering/limiting uses the datareader to filter/limit rows, and fetches 1 row at a time from the datareader, so it doesn't fetch the complete set.

Your query could result in duplicates: select customerid from customers join orders on customers.customerid = orders.customerid

gives duplicate customerids, because the join is a 1:n.

This behavior is in our framework for many years and therefore we can't (and actually we won't) change it. We wont change it because it's consistent with the entity fetch logic we have.

However, as some of you want to get duplicate rows with a take, page (I really have no idea why on earth one wants that, but alas... ), we'll add a feature to 3.1 which allows you to define the behavior of distinct in limited projections (like Take(), paging). In practice this won't be something you'd need however, as the data will be used somewhere and if you have to deal with duplicate rows (which have no meaning. If you want to count things, run a groupby with count), this always leads to unwanted behavior: the duplicate data has no meaning, it's the same as the other data already read.

Frans Bouma | Lead developer LLBLGen Pro
Deividas
User
Posts: 44
Joined: 01-Oct-2010
# Posted on: 11-Nov-2010 10:48:13   

Your query could result in duplicates: select customerid from customers join orders on customers.customerid = orders.customerid

gives duplicate customerids

This query is different than mine. I can see how this one can return duplicate results, but I still don't see how mine could. In this query there can be several orders with the same customerid. In my query there can be only one contact for the same customer so there can be no duplicate customerids.

Also LLBLGen is not consistent with my query because if I specify Skip(10).Take(10) or TakePage(1, 10) (which is the same as Take(10) but Take(10) doesn't work) or TakePage(2, 10) then the generated sql contains the logic for skipping and taking.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39908
Joined: 17-Aug-2003
# Posted on: 11-Nov-2010 11:54:04   

Deividas wrote:

Your query could result in duplicates: select customerid from customers join orders on customers.customerid = orders.customerid

gives duplicate customerids

This query is different than mine. I can see how this one can return duplicate results, but I still don't see how mine could. In this query there can be several orders with the same customerid. In my query there can be only one contact for the same customer so there can be no duplicate customerids.

True, but that's currently a limitation. We did research whether it is possible to determine if a query results in duplicates. Our conclusion was: no. Your situation is one where it is determinable, but if you instead of the pk in the 'n' side of the 1:n relationship you fetch the pk in the '1' side, the join looks the same, but the query will result in duplicates. So more info is needed. How much is unclear at this point. We have scheduled to do follow up research on this, which is targeted towards how much data is needed to solve this, if there's a way to solve this, and if we can provide this data at runtime in all situations. In the current code we can't hence we couldn't fix this at this point, the DQE simply hasn't enough info to make the decision. To see this, imagine there are multiple fields in the PK of the entity, instead of the 1 you're fetching, query stays the same. This means that suddenly there ARE duplicates possible. For the DQE however, it doesn't know whether this field, which is a pk (it knows that) is the only one.

Also LLBLGen is not consistent with my query because if I specify Skip(10).Take(10) or TakePage(1, 10) (which is the same as Take(10) but Take(10) doesn't work) or TakePage(2, 10) then the generated sql contains the logic for skipping and taking.

Hmm, you're right, you could say this is inconsistent with the Take(n) behavior.

The workitem to create a setting for forcing server-side TOP without distinct has been updated to take into account we already do the behavior when using paging. We can't change the .Take() behavior to the way you might have expected at this point (in the middle of a release) as it's a breaking change, so we do this on a version release (e.g. 3.1)

For these kind of queries, always specify Distinct(), it will get you with paging the results you want, and with Take() a query ran on the db.

Frans Bouma | Lead developer LLBLGen Pro
Deividas
User
Posts: 44
Joined: 01-Oct-2010
# Posted on: 11-Nov-2010 12:26:37   

Well there's one more inconsistency here. If I select customers without projection, then the query still can produce duplicates but this time Take(10) works on the server side. So taking into account this inconsistency I would consider this a bug, not a breaking change, because this is in essence the same query behaving in 2 different ways. wink

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39908
Joined: 17-Aug-2003
# Posted on: 11-Nov-2010 14:23:36   

Deividas wrote:

Well there's one more inconsistency here. If I select customers without projection, then the query still can produce duplicates but this time Take(10) works on the server side. So taking into account this inconsistency I would consider this a bug, not a breaking change, because this is in essence the same query behaving in 2 different ways. wink

Take(10) works on the server for entities as these are always fetched with 'allowDuplicates' == false, which means Distinct is always inside the query, if you specify it or not. This is because fetching entities won't produce duplicates. A projection could or could not produce duplicates so you have to actively specify it. This is a different thing, so not the same query simple_smile

Changing the Take() behavior IS a breaking change, because when we will change that (which will be a first, since we introduced projections in 2.0), all applications which rely on this will break. I.o.w.: no, we don't do that. simple_smile

In fact, I'd argue that the paging on the server with duplicates is a behavior which is inconsistent. But alas, it's a subjective thing, and thus a waste of time to get consensus over that. The same goes for Take()/Paging over an unordered set, which is also useless, but we don't test for that nor do we auto-add an order by like linq to sql does.

However, whatever one finds true or not, it's largely irrelevant when you consider the amount of code out there relying on the current behavior, so we don't change that now. What we will do is make it easier to specify what behavior you'd like to see: keep it as is (default) or more in line of what other o/r mappers do: simply return whatever the developer asked for, and act on the fact whether distinct was explicitly specified or not. For entities however, we always will implicitly specify distinct, so that's not affected by this. The main reason for that is that the framework in all cases has to avoid duplicate entity instances being created so if we don't specify distinct ourselves, the developer using the framework will have to do that every time anyway, something which is really a big pain in situations where the queries are generated for you, e.g. in services, grids with linq controls etc. it's then really nice to have auto-distinct on entities without you having to specify it everywhere, as it makes no sense to return multiple times the same entity in different objects in the same collection (semantically).

Frans Bouma | Lead developer LLBLGen Pro