- Home
- General
- General Chat
A SQL Challange #2 (in SQL Server 2000)
Joined: 22-Feb-2005
All,
I thought this was easy when I started doing it. I ended up using a dreaded cursor ( ) to achieve the expected result--I'm hoping someone has an idea of how it can be done more elegently. I'm reasonably sure it can be accomplished with a straight query, without cursors or temp tables (I could live with a temp table, actually). But, as I said, I eventually gave up and used a method I knew would work, since I had already pulled out so much hair getting to that point.
The actual problem (and the rules that require this query) is much more complex than the given example. It also involves a client's data, so posting the actual DDL is not an option. "Real" duplicate checking via constraints is also not possible in this situation, as the data must be allowed into the system so that it can be scrubbed later. The example itself may seem nonsensical, but it represents the problem fairly well.
Essentially, the operation is similar to duplicate checking. In fact, we call it a duplicate search. However, it's not quite that simple. The check for duplicates spans multiple tables, and will be used for reporting and taking action (not simply deleting).
The challenge:
Given a set of row PKs (a "batch"), find rows in the same table that have matching data in several columns across the same table and a related table. For each match, return the PK of the row that was checked for duplicates (gotcha #1 for me), and the PK of the row that was found as a duplicate. A row should not be considered as a duplicate of itself, but other rows in the batch can be duplicates of each other (gotcha #2 for me).
So, using the Northwinds Database as an example . . .
Given a batch of OrderIds, represented by the subquery:
SELECT OrderId from Orders where OrderID < 10500
Find records that have matching values in the columns: CustomerId, ShipVia, and ProductId.
Return matches as follows:
OrderId Matching_Order_Id
165----------375 445----------542 332----------110
(without the dashes--I can't seem to get those to space out right with spaces).
The following query shows that such duplicates exist, at minimum, within the batch:
SELECT COUNT(o.orderId), customerId, shipvia, productId
FROM orders o INNER JOIN [order details] od ON o.OrderId = od.OrderId
WHERE o.OrderID < 10500
GROUP BY customerId, shipvia, productId
HAVING COUNT(o.orderId) > 1
Thanks for any insight.
Phil
Joined: 05-Aug-2005
psandler wrote:
The challenge:
Given a set of row PKs (a "batch"), find rows in the same table that have matching data in several columns across the same table and a related table. For each match, return the PK of the row that was checked for duplicates (gotcha #1 for me), and the PK of the row that was found as a duplicate. A row should not be considered as a duplicate of itself, but other rows in the batch can be duplicates of each other (gotcha #2 for me).
SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate
FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
FROM Orders
JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o1
JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
FROM Orders
JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o2
on o1.CustomerId = o2.CustomerID
AND o1.ShipVia = o2.ShipVia
AND o1.ProductId = o2.ProductId
AND o1.OrderID <> o2.OrderID
WHERE o1.OrderID < 10500
I'm not 100% sure of a few things...
-
This will get you a list of all the OrderIDs that are less than 10500 that have duplicates of ANY order number. If you only want to find duplicates in the same set of orders you can add "AND o2.OrderID < 10500" to the WHERE.
-
This will show the inverses also. So, if order 1 had a duplicate of order 2, then order 2 will show order 1 as its duplicate. I can't think of a simple way to remove the inverse in this single select. However, I'm not 100% sure what you are after, but I think you may want the inverse. It depends on what you are doing with this data.
BOb
Joined: 22-Feb-2005
pilotboba wrote:
SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID FROM Orders JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o1 JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID FROM Orders JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o2 on o1.CustomerId = o2.CustomerID AND o1.ShipVia = o2.ShipVia AND o1.ProductId = o2.ProductId AND o1.OrderID <> o2.OrderID WHERE o1.OrderID < 10500
My face is red at how simple this turned out to be. I spent about 4 hours on it, and I consider myself to be an expert at SQL (at least my co-workers don't read this forum ).
pilotboba wrote:
I'm not 100% sure of a few things...
This will get you a list of all the OrderIDs that are less than 10500 that have duplicates of ANY order number. If you only want to find duplicates in the same set of orders you can add "AND o2.OrderID < 10500" to the WHERE.
This will show the inverses also. So, if order 1 had a duplicate of order 2, then order 2 will show order 1 as its duplicate. I can't think of a simple way to remove the inverse in this single select. However, I'm not 100% sure what you are after, but I think you may want the inverse. It depends on what you are doing with this data.
Nope, what you did was 100% to the spec of what I am trying to do.
After re-writing my queries, the new (non-cursor) way is much more efficient (40% faster, which will likely increase as more data gets entered).
Thanks much!!!
Phil
Joined: 08-Jun-2004
pilotboba wrote:
SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID FROM Orders JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o1 JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID FROM Orders JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID) o2 on o1.CustomerId = o2.CustomerID AND o1.ShipVia = o2.ShipVia AND o1.ProductId = o2.ProductId AND o1.OrderID <> o2.OrderID WHERE o1.OrderID < 10500
Pretty slick.. I would have ended up wasting time doing the check on the client.
That's cool
one more suggestion that might increase performance over large number of records is to move the "WHERE o1.OrderID < 10500" and the "WHERE o2.OrderID < 10500" if used into the inner selects.
Then the result would be:
SELECT DISTINCT o1.OrderID, o2.OrderID Duplicate
FROM (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
FROM Orders
JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
WHERE Orders.OrderID < 10500
) o1
JOIN (SELECT Orders.OrderID, CustomerID, ShipVia, ProductID
FROM Orders
JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
WHERE Orders.OrderID < 10500
) o2
on o1.CustomerId = o2.CustomerID
AND o1.ShipVia = o2.ShipVia
AND o1.ProductId = o2.ProductId
AND o1.OrderID <> o2.OrderID
Joined: 22-Feb-2005
Walaa wrote:
That's cool
one more suggestion that might increase performance over large number of records is to move the "WHERE o1.OrderID < 10500" and the "WHERE o2.OrderID < 10500" if used into the inner selects.
Yep, this is the way I ended up doing it.
On the real data, this subquery method (as opposed to the cusor method) was MUCH slower until I used DISTINCT on the subquery SELECTs. I think because it eliminates records and reduces the hash-join.