Monday, 22 June 2015

REMOVE DUPLICATE Stage in Datastage

  • This stage helps to remove duplicate records from data set depending on the key column. In this, two rows are supposed to be duplicate if they have identical values in key columns. We can define more than one key column.
  • Input data should be sorted for this stage so that all records of having similar key values will be adjacent. Instead of adding separate ‘Sort stage’, we can perform ‘Link Level Sort’.
Let’s try to implement Remove Duplicate stage in Datastage:

Consider a following Table
ID
First Name
Last Name
Location
Network ID
EmailID
Date
1
Jach
Simmons
Chicago
JS524
Letsc@gmail.com
20140506
2
Shumas
Jane
LA
Sj145
Jaene@ymail.com
20140506
8
Shumas
Jane
LA
Sj145
Jaene@ymail.com
20140506
3
Jonty
Waughn
Sydney
JW927
JontyW@sdbh.com
20140506
3
Jonty
Waughn
Sydney
JW927
JontyW@sdbh.com
20130506
3
Jonty
Waughn
Melbourn
SP417
JontyW@sdbh.com
20150506
4
Suhana
Safar
Maxico
SS99
Sas@gmail.com
20180506
We are going to select Key column as ‘ID’. In this table ID=2 and ID=8 have same data and for ID=3 there change in some record. Let’s see, what is going to happen with these records!
Here we have added ‘Date’ column to understand the function of ‘Duplicate To Retain’ option’s function. Will discuss in step 6.
Step 1: Design Your job structure like below

Step 2: Configure Employee detail file in which above tables is saved with the sequential file sf_emp_det and change the ‘Date Format’ to %yyyy%mm%dd. For reference see article “Sequential file in datastage”
Step 3: Double click on Remove duplicate stage i.e. here we named it as rd_on_ID. Below window will pop up.
2
Select ‘Key=ID’ under ‘Keys that defines Duplicates’. One can select multiple keys from ‘Available properties to add’ pane shown in right bottom corner.
‘Duplicate To Retain’ option provides facility to retain ‘First’ or ‘Last’ record for particular Duplicate key.
Step 4: As we are doing link level sort, it shows yellow ‘sort’ symbol on link ‘lnk_emp_det’. To perform this sort go under ‘Input’ tab in which select ‘Partition type’ as anyone (here ‘Hash’) under ‘Partitioning’ tab. Select column on which you want to perform sort from ‘Available’ left pane and check ‘Perform sort’ box as shown below.
Select ‘Date’ only for ‘sorting’ by right click on ‘Date’ in right pane as shown, no need of partitioning.
3
Step 5: Map all the required output column under ‘Output’ tab as shown.
4
Step 6: Now let’s what is the output. Just compile and run the job.
Output:

There was three record for ‘ID=3’ and as we are sorting data in ascending order and ‘Duplicate to Retain=First’, Remove Duplicate stage retain the latest according to dates mentioned. Here
3
Jonty
Waughn
Sydney
JW927
JontyW@sdbh.com
20130506
If we select ‘Duplicate to Retain=Last’
3
Jonty
Waughn
Melbourn
SP417
JontyW@sdbh.com
20150506

This will reflect in output.

No comments:

Post a Comment