All,
I thought this was easy when I started doing it. I ended up using a dreaded cursor ( ) to achieve the expected result--I'm hoping someone has an idea of how it can be done more elegently. I'm reasonably sure it can be accomplished with a straight query, without cursors or temp tables (I could live with a temp table, actually). But, as I said, I eventually gave up and used a method I knew would work, since I had already pulled out so much hair getting to that point.
The actual problem (and the rules that require this query) is much more complex than the given example. It also involves a client's data, so posting the actual DDL is not an option. "Real" duplicate checking via constraints is also not possible in this situation, as the data must be allowed into the system so that it can be scrubbed later. The example itself may seem nonsensical, but it represents the problem fairly well.
Essentially, the operation is similar to duplicate checking. In fact, we call it a duplicate search. However, it's not quite that simple. The check for duplicates spans multiple tables, and will be used for reporting and taking action (not simply deleting).
The challenge:
Given a set of row PKs (a "batch"), find rows in the same table that have matching data in several columns across the same table and a related table. For each match, return the PK of the row that was checked for duplicates (gotcha #1 for me), and the PK of the row that was found as a duplicate. A row should not be considered as a duplicate of itself, but other rows in the batch can be duplicates of each other (gotcha #2 for me).
So, using the Northwinds Database as an example . . .
Given a batch of OrderIds, represented by the subquery:
SELECT OrderId from Orders where OrderID < 10500
Find records that have matching values in the columns: CustomerId, ShipVia, and ProductId.
Return matches as follows:
OrderId Matching_Order_Id
165----------375
445----------542
332----------110
(without the dashes--I can't seem to get those to space out right with spaces).
The following query shows that such duplicates exist, at minimum, within the batch:
SELECT COUNT(o.orderId), customerId, shipvia, productId
FROM orders o INNER JOIN [order details] od ON o.OrderId = od.OrderId
WHERE o.OrderID < 10500
GROUP BY customerId, shipvia, productId
HAVING COUNT(o.orderId) > 1
Thanks for any insight.
Phil