Datastage for Learners: REMOVE DUPLICATE Stage in Datastage

This stage helps to remove duplicate records from data set depending on the key column. In this, two rows are supposed to be duplicate if they have identical values in key columns. We can define more than one key column.
Input data should be sorted for this stage so that all records of having similar key values will be adjacent. Instead of adding separate ‘Sort stage’, we can perform ‘Link Level Sort’.

Let’s try to implement Remove Duplicate stage in Datastage:

Consider a following Table

ID	First Name	Last Name	Location	Network ID	EmailID	Date
1	Jach	Simmons	Chicago	JS524	Letsc@gmail.com	20140506
2	Shumas	Jane	LA	Sj145	Jaene@ymail.com	20140506
8	Shumas	Jane	LA	Sj145	Jaene@ymail.com	20140506
3	Jonty	Waughn	Sydney	JW927	JontyW@sdbh.com	20140506
3	Jonty	Waughn	Sydney	JW927	JontyW@sdbh.com	20130506
3	Jonty	Waughn	Melbourn	SP417	JontyW@sdbh.com	20150506
4	Suhana	Safar	Maxico	SS99	Sas@gmail.com	20180506

We are going to select Key column as ‘ID’. In this table ID=2 and ID=8 have same data and for ID=3 there change in some record. Let’s see, what is going to happen with these records!
Here we have added ‘Date’ column to understand the function of ‘Duplicate To Retain’ option’s function. Will discuss in step 6.
Step 1: Design Your job structure like below

Step 2: Configure Employee detail file in which above tables is saved with the sequential file sf_emp_det and change the ‘Date Format’ to %yyyy%mm%dd. For reference see article “Sequential file in datastage”
Step 3: Double click on Remove duplicate stage i.e. here we named it as rd_on_ID. Below window will pop up.

Select ‘Key=ID’ under ‘Keys that defines Duplicates’. One can select multiple keys from ‘Available properties to add’ pane shown in right bottom corner.
‘Duplicate To Retain’ option provides facility to retain ‘First’ or ‘Last’ record for particular Duplicate key.
Step 4: As we are doing link level sort, it shows yellow ‘sort’ symbol on link ‘lnk_emp_det’. To perform this sort go under ‘Input’ tab in which select ‘Partition type’ as anyone (here ‘Hash’) under ‘Partitioning’ tab. Select column on which you want to perform sort from ‘Available’ left pane and check ‘Perform sort’ box as shown below.
Select ‘Date’ only for ‘sorting’ by right click on ‘Date’ in right pane as shown, no need of partitioning.

Step 5: Map all the required output column under ‘Output’ tab as shown.

Step 6: Now let’s what is the output. Just compile and run the job.
Output:

There was three record for ‘ID=3’ and as we are sorting data in ascending order and ‘Duplicate to Retain=First’, Remove Duplicate stage retain the latest according to dates mentioned. Here

Jonty

Waughn

Sydney

JW927

JontyW@sdbh.com

20130506

If we select ‘Duplicate to Retain=Last’

Jonty

Waughn

Melbourn

SP417

JontyW@sdbh.com

20150506

This will reflect in output.

Datastage for Learners

Monday, 22 June 2015

REMOVE DUPLICATE Stage in Datastage

No comments:

Post a Comment