- This stage helps to remove duplicate records from data set depending on the key column. In this, two rows are supposed to be duplicate if they have identical values in key columns. We can define more than one key column.
- Input data should be sorted for this stage so that all records of having similar key values will be adjacent. Instead of adding separate ‘Sort stage’, we can perform ‘Link Level Sort’.
Consider a following Table
ID
|
First Name
|
Last Name
|
Location
|
Network ID
|
EmailID
|
Date
|
1
|
Jach
|
Simmons
|
Chicago
|
JS524
|
Letsc@gmail.com
|
20140506
|
2
|
Shumas
|
Jane
|
LA
|
Sj145
|
Jaene@ymail.com
|
20140506
|
8
|
Shumas
|
Jane
|
LA
|
Sj145
|
Jaene@ymail.com
|
20140506
|
3
|
Jonty
|
Waughn
|
Sydney
|
JW927
|
JontyW@sdbh.com
|
20140506
|
3
|
Jonty
|
Waughn
|
Sydney
|
JW927
|
JontyW@sdbh.com
|
20130506
|
3
|
Jonty
|
Waughn
|
Melbourn
|
SP417
|
JontyW@sdbh.com
|
20150506
|
4
|
Suhana
|
Safar
|
Maxico
|
SS99
|
Sas@gmail.com
|
20180506
|
Here we have added ‘Date’ column to understand the function of ‘Duplicate To Retain’ option’s function. Will discuss in step 6.
Step 1: Design Your job structure like below
Step 2: Configure Employee detail file in which above tables is saved with the sequential file sf_emp_det and change the ‘Date Format’ to %yyyy%mm%dd. For reference see article “Sequential file in datastage”
Step 3: Double click on Remove duplicate stage i.e. here we named it as rd_on_ID. Below window will pop up.

Select ‘Key=ID’ under ‘Keys that defines Duplicates’. One can select multiple keys from ‘Available properties to add’ pane shown in right bottom corner.
‘Duplicate To Retain’ option provides facility to retain ‘First’ or ‘Last’ record for particular Duplicate key.
Step 4: As we are doing link level sort, it shows yellow ‘sort’ symbol on link ‘lnk_emp_det’. To perform this sort go under ‘Input’ tab in which select ‘Partition type’ as anyone (here ‘Hash’) under ‘Partitioning’ tab. Select column on which you want to perform sort from ‘Available’ left pane and check ‘Perform sort’ box as shown below.
Select ‘Date’ only for ‘sorting’ by right click on ‘Date’ in right pane as shown, no need of partitioning.

Step 5: Map all the required output column under ‘Output’ tab as shown.

Step 6: Now let’s what is the output. Just compile and run the job.
Output:
There was three record for ‘ID=3’ and as we are sorting data in ascending order and ‘Duplicate to Retain=First’, Remove Duplicate stage retain the latest according to dates mentioned. Here
3
|
Jonty
|
Waughn
|
Sydney
|
JW927
|
JontyW@sdbh.com
|
20130506
|
3
|
Jonty
|
Waughn
|
Melbourn
|
SP417
|
JontyW@sdbh.com
|
20150506
|
This will reflect in output.
No comments:
Post a Comment