Sunday, 21 June 2015

Merge Stage

Merge Stage:
The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)
Merge stage combines a mster dataset with one or more update datasets based on the key columns.the output record contains all the columns from master record plus any additional columns from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.
Example :
Master dataset:
CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 POOJITHA
Update dataset1


CUSTOMER_ID CITY ZIP_CODE SEX
1 CYPRESS 90630 M
2 CYPRESS 90630 F
Output:



CUSTOMER_ID CUSTOMER_NAME CITY ZIP_CODE SEX
1 UMA CYPRESS 90630 M
2 POOJITHA CYPRESS 90630 F
Merge stage configuration steps:

Options:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the master link.

Partitioning:Hash on both master input and update input as shown below:
Compile and run the job :

Scenario 2:
Remove a record from the updateds1 and check the output:

Check for the datastage warning in the job log as we have selected Warn on unmatched masters = TRUE
stg_merge,0: Master record (0) has no updates.
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1

Scenario 4:Insert a duplicate record with same customer id in the master dataset and check for the results.
Look at the output and it is clear that merge stage automatically dropped the duplicate record from master dataset.

Scenario 4:Added new updatedataset2 which contains following data.
Update Dataset2
CUSTOMER_ID CITIZENSHIP
1 INDIAN
2 AMERICAN

Still we have duplicate row in the master dataset.if you compile the job with above design you will get compilation error like below.

If you look ate the above figure you can see 2 rows in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .

Scenario 5:add a duplicate row for customer_id=1 in updateds1 dataset.

Now we have duplicate record both in master dataset and updateds1.Run the job and check the results and warnings in the job log.

No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630 instead of 90620.

Run the job and check output results.

I ran the same job multiple times and found the merge stage is taking first record coming as input from the updateds1 and dropping the next records with same customer id.

  • Like JOIN stage it can have any number of inputs but other than one input link, called as ‘Master’, remaining are refer as input Update link. It means MERGE stage  links has one ‘Master’ input link but any number of update input link and one output link but it has same number of Reject links as there are  Update input link.
  • Reject Link contains data from respective input Update link which are failed to match with Master.
Let’s try to implement this stage using two update links.
Consider following three tables :
Table1 : Employee Details
ID
First Name
Last Name
Location
Network ID
EmailID
1
Jach
Simmons
Chicago
JS524
Letsc@gmail.com
2
Shumas
Jane
LA
Sj145
Jaene@ymail.com
3
Jonty
Waughn
Sydney
JW927
JontyW@sdbh.com
4
Suhana
Safar
Maxico
SS99
Sas@gmail.com
7
abc
cde
egh
ijk785
lmn@gmail.com
Table 2 : Employee’s department details
ID
Dept
Dept Head
1
Electronics
Paul
3
CS
Jack
4
TS
Summur
5
IT
Sean
Table 3 : Employee’s training center details
ID
Training Cent
1
CKG
2
AMD
3
WC
Please insure that input to Merge stage must be sorted and key partitioned in case of large amount of data. As we are now dealing with small data, let’s focus on keeping data sorted.
Step 1 : Design job structure like this and configure these three tables with three different sequential files. Shown by  sf_emp_det, sf_dept_det, sf_training_cent_det
1
Step 2 : Rename Links name, stage name for understanding purpose. Let’s configure Merge stage. Double click on ‘Merge’. Window will pop up, in that under ‘Stage’ tab, declare Merge Key. In this case ‘Key = ID’ as shown. We can declare more than one merge key. There is a Sort Order option as key should be sorted for Merge stage.
Unmatched Masters Mode : This option provides to  keep or drop unmatched record from output. Refer step 5.
Warn On Unmatched Masters :This will give warning whenever there will be unmatched records from master link. Set to ‘True’ by default. ‘False’ will cause no warning to be receive.
Warn On Reject Updates :This will give warning whenever bad records from any update links are rejected. Set to ‘True’ by default. ‘False’ will cause no warning to be receive.
 2
Step 3 : Under ‘Link Ordering’ Tab we can select ‘Master’ and ‘Update’ links as shown. Left column shows for input links and right column shows for output links.
3
Step 4 :Map output columns under ‘Output’ tab as shown. We can map key column only from Master link. If we change Master link to lnk_Dept_det then we have to map key column from  lnk_Dept_det otherwise it will throw error while compiling.
4
Step 5 :
Master Merge output
For ‘Unmatched Masters Mode = Keep’ as discussed in step 2. Its showing matched as well as unmatched records with input update links.
5
For ‘Unmatched Masters Mode = Drop’. It will drop all unmatched record as shown.

6



Reject Links :These links will show Unmatched records from respective update link.
 For rjt_dept_det link :

For rjk_training_cent_det link


Difference between Join Stage and Merge Stage in Datastage

 Join stage and Merge stage have some different input requirements.

Based on the Requirement we use these Stages which is good for the performance.

We need to see weather we get good performance by using any stage in datastage.

And the stage supports the required inputs are not.


Lets say Join Stage= J.S And Merge Stage = M.S


J.S - The input names of the Join Stage are Left tables , Right Tables and Intermediate Tables. That means we call the left one as a Left table and right one as a Right table and remaining tables between these tables are call it as Intermediate tables. ( That can be any number of tables in between )


M.S - The input names of the Merge Stage are Master Tables and Update Tables. That means First table will be considered as a Master tables and remaining any number of tables are considered as a Update tables.

J.S - We can perform four types of Joins in Join Stage. That means it supports all the four types of Joins. They are
Inner Join
Left Outer Join
Right Outer Join
Full Outer Join

M.S - We can perform only two types of Joins in Merge Stage. That means it supports two types of Joins here. And they are
Inner Join
And Left Outer Join

J.S - The Input requirements of Join stage are

There will be a N-Inputs ( In the case of Left, Inner, Right Outer Joins)
There will be a 2 Inputs ( In the case of Full Outer Join)

And there will be a 1 Output link and
there will be no reject links in Join stage.

M.S - The input requirements of Merge Stage are as follows

There will be a N-Inputs
1 Output
And N-1 Reject Links.

J.S - And Coming to Memory type. This is light memory Usage

M.S - It is also a Light Memory Usage

Only LookUp Stage is considered as a Heavy Memory usage


J.S - Key Column Names should be Same. That is Primary record should be same with Secondary Records

M.S - Key column names should be same here too. That is Primary records should be same with Secondary Records.

T
e Inner Join Type are as follows

J.S - Primary Records Should match with all secondary

i.e

M.S - Primary Records should match with any secondary.

The Input requirements with respect to Sorting are as follows.

J.S - In Join Stage Primary records and Secondary records should be sorted when coming( i.e data sorting is mandatory).

M.S - It was same with Merge stage also. That is all the primary and secondary records should be Sorted ( i.e Mandatory)


And Treatment of Unmateched Records will be as follows

J.S - OK for the Primary and Secondary Records if the data is Unmatched records.

M.s - We get Warning message if the Primary records are Unmatched

And it is Ok for the secondary records.

No comments:

Post a Comment