A SQL Challange #2 (in SQL Server 2000)

Posts   
 
    
psandler
User
Posts: 540
Joined: 22-Feb-2005
# Posted on: 06-Dec-2005 05:15:11   

All,

I thought this was easy when I started doing it. I ended up using a dreaded cursor ( flushed ) to achieve the expected result--I'm hoping someone has an idea of how it can be done more elegently. I'm reasonably sure it can be accomplished with a straight query, without cursors or temp tables (I could live with a temp table, actually). But, as I said, I eventually gave up and used a method I knew would work, since I had already pulled out so much hair getting to that point.

The actual problem (and the rules that require this query) is much more complex than the given example. It also involves a client's data, so posting the actual DDL is not an option. "Real" duplicate checking via constraints is also not possible in this situation, as the data must be allowed into the system so that it can be scrubbed later. The example itself may seem nonsensical, but it represents the problem fairly well.

Essentially, the operation is similar to duplicate checking. In fact, we call it a duplicate search. However, it's not quite that simple. The check for duplicates spans multiple tables, and will be used for reporting and taking action (not simply deleting).

The challenge:

Given a set of row PKs (a "batch"), find rows in the same table that have matching data in several columns across the same table and a related table. For each match, return the PK of the row that was checked for duplicates (gotcha #1 for me), and the PK of the row that was found as a duplicate. A row should not be considered as a duplicate of itself, but other rows in the batch can be duplicates of each other (gotcha #2 for me).

So, using the Northwinds Database as an example . . .

Given a batch of OrderIds, represented by the subquery:

SELECT OrderId from Orders where OrderID < 10500

Find records that have matching values in the columns: CustomerId, ShipVia, and ProductId.

Return matches as follows:

OrderId Matching_Order_Id

165----------375 445----------542 332----------110

(without the dashes--I can't seem to get those to space out right with spaces). simple_smile

The following query shows that such duplicates exist, at minimum, within the batch:

SELECT COUNT(o.orderId), customerId, shipvia, productId
FROM orders o INNER JOIN  [order details] od ON o.OrderId = od.OrderId
WHERE o.OrderID < 10500
GROUP BY customerId, shipvia, productId
HAVING COUNT(o.orderId) > 1

Thanks for any insight.

Phil

pilotboba
User
Posts: 434
Joined: 05-Aug-2005
# Posted on: 06-Dec-2005 16:42:42   

psandler wrote:

The challenge:

Given a set of row PKs (a "batch"), find rows in the same table that have matching data in several columns across the same table and a related table. For each match, return the PK of the row that was checked for duplicates (gotcha #1 for me), and the PK of the row that was found as a duplicate. A row should not be considered as a duplicate of itself, but other rows in the batch can be duplicates of each other (gotcha #2 for me).

 SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate
FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o1
JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o2
    on o1.CustomerId = o2.CustomerID 
    AND o1.ShipVia = o2.ShipVia
    AND o1.ProductId = o2.ProductId
    AND o1.OrderID <> o2.OrderID
WHERE o1.OrderID < 10500

I'm not 100% sure of a few things...

  1. This will get you a list of all the OrderIDs that are less than 10500 that have duplicates of ANY order number. If you only want to find duplicates in the same set of orders you can add "AND o2.OrderID < 10500" to the WHERE.

  2. This will show the inverses also. So, if order 1 had a duplicate of order 2, then order 2 will show order 1 as its duplicate. I can't think of a simple way to remove the inverse in this single select. However, I'm not 100% sure what you are after, but I think you may want the inverse. It depends on what you are doing with this data.

BOb

psandler
User
Posts: 540
Joined: 22-Feb-2005
# Posted on: 06-Dec-2005 20:57:35   

pilotboba wrote:

 SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate
FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o1
JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o2
    on o1.CustomerId = o2.CustomerID 
    AND o1.ShipVia = o2.ShipVia
    AND o1.ProductId = o2.ProductId
    AND o1.OrderID <> o2.OrderID
WHERE o1.OrderID < 10500

My face is red at how simple this turned out to be. I spent about 4 hours on it, and I consider myself to be an expert at SQL (at least my co-workers don't read this forum smile ).

pilotboba wrote:

I'm not 100% sure of a few things...

  1. This will get you a list of all the OrderIDs that are less than 10500 that have duplicates of ANY order number. If you only want to find duplicates in the same set of orders you can add "AND o2.OrderID < 10500" to the WHERE.

  2. This will show the inverses also. So, if order 1 had a duplicate of order 2, then order 2 will show order 1 as its duplicate. I can't think of a simple way to remove the inverse in this single select. However, I'm not 100% sure what you are after, but I think you may want the inverse. It depends on what you are doing with this data.

Nope, what you did was 100% to the spec of what I am trying to do.

After re-writing my queries, the new (non-cursor) way is much more efficient (40% faster, which will likely increase as more data gets entered).

Thanks much!!!

Phil

alexdresko
User
Posts: 336
Joined: 08-Jun-2004
# Posted on: 06-Dec-2005 21:44:48   

pilotboba wrote:

 SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate
FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o1
JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o2
    on o1.CustomerId = o2.CustomerID 
    AND o1.ShipVia = o2.ShipVia
    AND o1.ProductId = o2.ProductId
    AND o1.OrderID <> o2.OrderID
WHERE o1.OrderID < 10500

Pretty slick.. I would have ended up wasting time doing the check on the client.

Walaa avatar
Walaa
Support Team
Posts: 14987
Joined: 21-Aug-2005
# Posted on: 07-Dec-2005 14:37:42   

That's cool

one more suggestion that might increase performance over large number of records is to move the "WHERE o1.OrderID < 10500" and the "WHERE o2.OrderID < 10500" if used into the inner selects.

Then the result would be:

SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate
FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
        WHERE Orders.OrderID < 10500 
        ) o1
JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
        FROM Orders 
        JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
        WHERE Orders.OrderID < 10500 
        ) o2
    on o1.CustomerId = o2.CustomerID 
    AND o1.ShipVia = o2.ShipVia
    AND o1.ProductId = o2.ProductId
    AND o1.OrderID <> o2.OrderID
psandler
User
Posts: 540
Joined: 22-Feb-2005
# Posted on: 07-Dec-2005 16:05:16   

Walaa wrote:

That's cool

one more suggestion that might increase performance over large number of records is to move the "WHERE o1.OrderID < 10500" and the "WHERE o2.OrderID < 10500" if used into the inner selects.

Yep, this is the way I ended up doing it.

On the real data, this subquery method (as opposed to the cusor method) was MUCH slower until I used DISTINCT on the subquery SELECTs. I think because it eliminates records and reduces the hash-join.