Subqueries vs. Joins

Posts   
 
    
psandler
User
Posts: 540
Joined: 22-Feb-2005
# Posted on: 11-Aug-2005 17:44:34   

Ok, this isn't LLBL Gen related (although discussion of this did come up in a thread here in the past about LLBL-generated SQL).

I was writing a demo case for a small best practices document at work, and I was planning on using test cases like these (using Northwind):

 SELECT 
    o.orderid,
    o.orderdate,
    (SELECT companyname FROM shippers s WHERE s.shipperid = o.shipvia) companyname,
    (SELECT lastname + ', ' + firstname FROM employees e WHERE e.employeeid = o.employeeid) FullName
FROM 
    orders o
WHERE
    o.orderdate between '1-1-1996' and '12-31-1996'
ORDER BY 
    o.orderdate
SELECT 
    o.orderid,
    o.orderdate,
    s.companyname,
    e.lastname + ', ' + e.firstname FullName
FROM 
    orders o
    INNER JOIN shippers s ON s.shipperid = o.shipvia
    INNER JOIN employees e ON e.employeeid = o.employeeid
WHERE
    o.orderdate between '1-1-1996' and '12-31-1996'
ORDER BY 
    o.orderdate 

The first uses subqueries, the second uses joins.

The point was going to be: don't use a subquery when you can use a join, as it's less efficient.

If you run these queries as-is, the execution plan shows a (slightly) better "cost" efficiency for the one that uses joins.

However, if you remove the WHERE and ORDER BY clauses, the subquery has a (significantly!) better cost efficiency (running them together in QA and then checking the combined execution plan is the easiest way to get an idea of relative costs, if you want to try this yourself).

Anyone know why the subqueries would be more efficient in this case?

If the tables had 50,000 records in them, do you think the joins would be more efficient?

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39801
Joined: 17-Aug-2003
# Posted on: 12-Aug-2005 10:03:00   

Your subquery testcase is one of the ways to use subqueries. This forum uses a view to pull the thread info with user stats etc. in one go for the thread list. When I was altering it to add the 'last post by user' feature, I had to add the userid and name to the select list. The query is not that huge, but spans several tables, so I first thought, well, lets add it with a subquery like you have done.

It was dogslow. Then I decided to join the table for tests and simply pull the fields from the joined list. it was very fast (the execution plans showed it). The dataset to read from is not that small, thousands of threads and tens of thousands of messages.

The thing was: a subquery in a select list is hard to optimize, and runs many times. A join can be optimized very well, when you know the size of the pool to merge with so you can limit the # of rows to work with later on (as with inner joins for example, there is no order in which you execute them).

Subqueries have advantages in queries where the selected data is from the entity on the 1 side in a 1:n relation and the filter is on the n side. With a join, this results in a lot of duplicate rows for the '1' side (picture customer 1:n order, and each customer has 100 orders, and you filter on a field in order. This gives a lot of duplicate rows for customer).

To filter these out, you need an extra step in the query. With subqueries you don't have to. With a proper index the filter can be very efficient (execution plan-wise) and you won't get any duplicate rows, which can be a great deal when there are a lot of instances on the 'n' side of the relation.

Frans Bouma | Lead developer LLBLGen Pro