Manage duplicated input Business Keys

May 2, 2013 at 7:36 AM
Edited May 2, 2013 at 8:12 AM
This is the issue when one batch of source data (in SCD Merge Wizard terms, one SELECT from the source table) contains duplicate rows, considering the combination of Business Key columns.

This is certainly an issue for the SSIS SCD transformation. If two "New" duplicate rows arrive in the same buffer, they will both be sent to the "New" output and the output table will end up with duplicate Business Keys.

According to my testing, this is also an issue for the T-SQL MERGE command as generated by the SCD Merge Wizard, also resulting in duplicate Business Keys in the target table.

My solution is to add a ROW_NUMBER to the source SELECT and filter for just the first row, e.g.
SELECT * FROM (
SELECT
   <source table column list>
    , ROW_NUMBER () OVER (
        PARTITION BY BusinessKey1 , BusinessKey2
        ORDER BY BusinessKey1 , BusinessKey2
        ) AS Business_Key_Row_Number
FROM SourceTable
) d1
WHERE Business_Key_Row_Number = 1
Could this solution be added automatically by the wizard - perhaps controlled by a Configuration option? It would be nice to be able to specify extra columns to be added to the ORDER BY clause to control which row is chosen (e.g by LastUpdateDate DESC to get the most recent row).

Thanks
Mike
Coordinator
Jun 4, 2013 at 2:29 PM
Edited Jun 4, 2013 at 2:29 PM
Hi Mike!

Sorry for late answer for this discussion, I was working on the application redesign to implement Type 3 transformation (this was my priority). In the new few days I will start fixing bugs and implementing these "additional" features.

Cheers,
Miljan
Coordinator
Jun 6, 2013 at 1:42 AM
Hi!

I started working on duplicate business keys check module.

Thank you so much for the solution provided above but I think it cannot be applied because in SCD Merge Wizard user has an option to define command text instead of the table / view and this command text can be more complicated than just simple select <field_1>, <field_2>,...,<field_n> from <table> statement: user can write common table expressions, grouping, where, etc. statements and with this in mind, it is very complicated to implement duplicate business key logic the way you proposed.

Nevertheless, duplicate business key logic can be implemented and I came up with two possible solutions:
  1. Using .NET DataTable and grouping the data there (using LINQ or similar technique), or
  2. Using temporary table which can be created on the source and querying this table
Let me know if you come up with some other solution.

Cheers,
Miljan