Join Stage:
The Join stage is a processing stage. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based
on the values of key columns. The other two are:
The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input.
If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically
does, cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.
In the below sections will show join stage configuration in datastage jobs:
Left Dataset:
Right Dataset or Intermediate Dataset:
Scenario 1)
InnerJoin:Inner transfers records from input data sets whose key columns contain equal values to the output
data set. Records whose key columns do not contain equal values are dropped.
Step1)Configure sequential file stage to read both inputs.
Step2)Configure Join stage
In order to specify join key columns both column names should be same from both the inputs.In case of lookup stage this is not required which is also a difference between join and lookup stages.
Click on properties tab in the stage page and select key columns to perform the join.In our case we are using customer_id as key column .
Join stage expects both inputs need to be partitioned and sorted based on join key columns.Go to the partitioning tab of input stage page and select hash partitioning as shown in below figure.


Go to the Advanced tab in stage page and clear the preserve partition flag to avoid warning while writing into sequential file stage.If you are using dataset then you can use propagate becuase dataset is parallel stage it will keep the exact partitions.

Go to the Mapping tab of the output page and select the required columns in the output

Go to the Link Ordering tab in the stage page and check the input link order.


Compile and run the job:


I am adding new column Customer Name in the right link to explain a scenario where non-key column name should be different from both inputs otherwise will get warning from the datastage job.

Compile and run the job and check the job log.We got following warning in the job.
stg_InnerJoin: When checking operator: Dropping component “CUSTOMER_NAME” because of a prior component with the same name.
Note:Non Key column names should be different from both the inputs to the join stage to avoid the above warning.
2)Left Outer Join

Compile and run the job and check results:
If you observe the output you can see all 3 left rows came out from the join stage.


Join stage doesnot have reject link to capture unmatched rows.Add a transformer stage after join stage and check for unmacted rows and write it to a another link.
when a match does not occur, the Join stage inserts values into the unmatched non-key column(s) using the following rules:
a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant value,
to each of the inner links and test that column for the constant after you have performed the join. This
isolates your match/no-match logic from any changes in the metadata. This is also handy with Lookups
that have multiple reference links.
Add a column generator stage and define MATCH_IND column and specify value = ‘Y’.In the transformer stage check If MATCH_IND=’Y’ then matched else unmatched.






Compile and run the job:



The Join stage is a processing stage. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based
on the values of key columns. The other two are:
The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input.
If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically
does, cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.
In the below sections will show join stage configuration in datastage jobs:
Left Dataset:
CUSTOMER_ID | CUSTOMER_NAME |
1 | UMA |
2 | POOJI |
3 | KATRU |
CUSTOMER_ID | CITY | ZIP_CODE |
1 | CYPRESS | 90630 |
2 | CYPRESS | 90630 |
InnerJoin:Inner transfers records from input data sets whose key columns contain equal values to the output
data set. Records whose key columns do not contain equal values are dropped.
Step1)Configure sequential file stage to read both inputs.
Step2)Configure Join stage
In order to specify join key columns both column names should be same from both the inputs.In case of lookup stage this is not required which is also a difference between join and lookup stages.
Click on properties tab in the stage page and select key columns to perform the join.In our case we are using customer_id as key column .
Join stage expects both inputs need to be partitioned and sorted based on join key columns.Go to the partitioning tab of input stage page and select hash partitioning as shown in below figure.


Go to the Advanced tab in stage page and clear the preserve partition flag to avoid warning while writing into sequential file stage.If you are using dataset then you can use propagate becuase dataset is parallel stage it will keep the exact partitions.

Go to the Mapping tab of the output page and select the required columns in the output

Go to the Link Ordering tab in the stage page and check the input link order.


Compile and run the job:


I am adding new column Customer Name in the right link to explain a scenario where non-key column name should be different from both inputs otherwise will get warning from the datastage job.

Compile and run the job and check the job log.We got following warning in the job.
stg_InnerJoin: When checking operator: Dropping component “CUSTOMER_NAME” because of a prior component with the same name.
Note:Non Key column names should be different from both the inputs to the join stage to avoid the above warning.
2)Left Outer Join

Compile and run the job and check results:
If you observe the output you can see all 3 left rows came out from the join stage.


Join stage doesnot have reject link to capture unmatched rows.Add a transformer stage after join stage and check for unmacted rows and write it to a another link.
when a match does not occur, the Join stage inserts values into the unmatched non-key column(s) using the following rules:
a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant value,
to each of the inner links and test that column for the constant after you have performed the join. This
isolates your match/no-match logic from any changes in the metadata. This is also handy with Lookups
that have multiple reference links.
Add a column generator stage and define MATCH_IND column and specify value = ‘Y’.In the transformer stage check If MATCH_IND=’Y’ then matched else unmatched.






Compile and run the job:



Get to know the facts of Join Stage in Datastage
The "Join" stage is a processing stage that performs a join operation on two
or more input information sets and afterward gives output as one resultant data set.
There are Four sorts of joins in DataStage:
1. Left Outer Join: It joins two or more tables on JOIN
condition where result table preserves record from first (Left)
table and puts NULL wherever unmatched record from second (Right) table.
2. Right Outer Join: It is precisely inverse to the Left external
join. It joins two tables on JOIN condition where result table preserves record from first (Right) table and puts
NULL wherever unmatched record from second (Right) table.
3. Inner Join: It joins two or more tables and returns just
those records that fulfill
join condition.
4. Full Outer Join: It joins two or more tables and returns both
coordinated and unmatched records from all tables.
Design your job like below.
Step 2 : Configure Table 1 and Table 2 in Sequential record 1 and
Sequential document 2.
Step 3 : While designing section name, verify that you ought to give
right Datatype against every segment overall output won't be appropriately sort. Give
"Number" Datatype for "ID" where we can utilize
"varchar" too for it.
Step 4 : Double click on the JOIN stage. Under Properties tab notice
'Key = ID'. Beneath this
tab you can notice Join Type as Inner, Left outer or Right Outer. Now, we should Practice for Left outer Join first.
Step 5 : Beneath
'link order' tab you can
say or mention Left and
right link on that Type of join is depends. i.e.Left or Right
external. Order of link doesn't
make a difference for Inner Join.
Step 6 : Map the obliged section under output stage.
Step 7 : Link order is depends on key section need to mapped. Those are not
mapped will be shown by Red shading or it will indicate lapse while
accumulating.
Difference between Join Stage and Look Up Stage in Datastage
Join stage and Look Up stage have some different input requirements.
Based on the Requirement we use these Stages which is good for the performance.
We need to see weather we get good performance by using any stage in datastage.
And the stage supports the required inputs are not.
Lets say Join Stage= J.S And Look Up = M.S
J.S - The input names of the Join Stage are Left tables , Right Tables and Intermediate Tables. That means we call the left one as a Left table and right one as a Right table and remaining tables between these tables are call it as Intermediate tables. ( That can be any number of tables in between )
L.S - The input names of the Look Up stage are Primary Tables and Reference Tables. That means First table will be considered as a Master tables and remaining any number of tables are considered as a Update tables.
J.S - We can perform four types of Joins in Join Stage. That means it supports all the four types of Joins. They are
Inner Join
Left Outer Join
Right Outer Join
Full Outer Join
L.S - We can perform only two types of Joins in Look Stage. That means it supports two types of Joins here. And they are
Inner Join
And Left Outer Join
J.S - The Input requirements of Join stage are
There will be a N-Inputs ( In the case of Left, Inner, Right Outer Joins)
There will be a 2 Inputs ( In the case of Full Outer Join)
And there will be a 1 Output link and
there will be no reject links in Join stage.
L.S - The input requirements of Look Up Stage are as follows
There will be a N-Inputs ( In the Case of Normal Stage)
2 Inputs (In the Case of Sparse Look Up )
1 Output
And 1 Reject Link.
J.S - And Coming to Memory type. This is light memory Usage
L.S - It is a Heavy Memory Usage
J.S - Key Column Names should be Same. That is Primary record should be same with Secondary Records
L.S - Key column names Optional.
It should be same in the case of Sparse Look U
.
The Inner Join Type are as follows
J.S - Primary Records Should match with all secondary
i.e
J.S - Primary Records should match with all secondary.
The Input requirements with respect to Sorting are as follows.
J.S - In Join Stage Primary records and Secondary records should be sorted when coming( i.e data sorting is mandatory).
L.S - In Look Up stage it is Optional. That is all the primary and secondary records no need to be sorted.
And Treatment of Unmateched Records will be as follows
J.S - OK for the Primary and Secondary Records if the data is Unmatched records.
L.s - Ok for the Primary and we get warning if secondary records are unmatched.
Based on the Requirement we use these Stages which is good for the performance.
We need to see weather we get good performance by using any stage in datastage.
And the stage supports the required inputs are not.
Lets say Join Stage= J.S And Look Up = M.S
J.S - The input names of the Join Stage are Left tables , Right Tables and Intermediate Tables. That means we call the left one as a Left table and right one as a Right table and remaining tables between these tables are call it as Intermediate tables. ( That can be any number of tables in between )
L.S - The input names of the Look Up stage are Primary Tables and Reference Tables. That means First table will be considered as a Master tables and remaining any number of tables are considered as a Update tables.
J.S - We can perform four types of Joins in Join Stage. That means it supports all the four types of Joins. They are
Inner Join
Left Outer Join
Right Outer Join
Full Outer Join
L.S - We can perform only two types of Joins in Look Stage. That means it supports two types of Joins here. And they are
Inner Join
And Left Outer Join
J.S - The Input requirements of Join stage are
There will be a N-Inputs ( In the case of Left, Inner, Right Outer Joins)
There will be a 2 Inputs ( In the case of Full Outer Join)
And there will be a 1 Output link and
there will be no reject links in Join stage.
L.S - The input requirements of Look Up Stage are as follows
There will be a N-Inputs ( In the Case of Normal Stage)
2 Inputs (In the Case of Sparse Look Up )
1 Output
And 1 Reject Link.
J.S - And Coming to Memory type. This is light memory Usage
L.S - It is a Heavy Memory Usage
J.S - Key Column Names should be Same. That is Primary record should be same with Secondary Records
L.S - Key column names Optional.
It should be same in the case of Sparse Look U
.
The Inner Join Type are as follows
J.S - Primary Records Should match with all secondary
i.e
J.S - Primary Records should match with all secondary.
The Input requirements with respect to Sorting are as follows.
J.S - In Join Stage Primary records and Secondary records should be sorted when coming( i.e data sorting is mandatory).
L.S - In Look Up stage it is Optional. That is all the primary and secondary records no need to be sorted.
And Treatment of Unmateched Records will be as follows
J.S - OK for the Primary and Secondary Records if the data is Unmatched records.
L.s - Ok for the Primary and we get warning if secondary records are unmatched.
What To Choose Join Stage or Lookup stage in Datastage
We need to be careful when selecting the stages. We need to think about the
performance of the Job before selecting the stages. Time is more precious to the
clients. That's why we need to get the Job for very less time. We need to try
our best to get good performance to the Job.
Both the stages Join stage and Look up stage performs same thing. That is they
combine the tables we have. But why Lookup stage has been introduced.
Look Up Stage have some extra benefits which will not come with the Join stage.
Look up stage does not required the data to be sorted. Sorting is mandatory with
the Join stage. In Look Up stage the columns with different column names can be
joined as well where it is not possible in the Join stage. That means Join stage,
the column name must be similar.
A Look Up Stage supports reject links , if our required demands reject links we
can’t go with Join stage. Because Join stage doesn’t supports Reject Links. And
Lookup stage has an option to fail the Job if the look up fails. It will be useful
when the look up stage is expected to be successful.
text-align: left;">
Look up stage keeps the reference data into the memory which yields better
performance for smaller volume of data. If you have large amount of data, you
need to go with Join stage.
performance of the Job before selecting the stages. Time is more precious to the
clients. That's why we need to get the Job for very less time. We need to try
our best to get good performance to the Job.
Both the stages Join stage and Look up stage performs same thing. That is they
combine the tables we have. But why Lookup stage has been introduced.
Look Up Stage have some extra benefits which will not come with the Join stage.
Look up stage does not required the data to be sorted. Sorting is mandatory with
the Join stage. In Look Up stage the columns with different column names can be
joined as well where it is not possible in the Join stage. That means Join stage,
the column name must be similar.
A Look Up Stage supports reject links , if our required demands reject links we
can’t go with Join stage. Because Join stage doesn’t supports Reject Links. And
Lookup stage has an option to fail the Job if the look up fails. It will be useful
when the look up stage is expected to be successful.
text-align: left;">
Look up stage keeps the reference data into the memory which yields better
performance for smaller volume of data. If you have large amount of data, you
need to go with Join stage.
Hi. I am new to Datastage. I am using join stage to join two tables (Which have columns with same name CLient_ID and POstcode) from DRS connector. I am using full outer join because I need columns from both tables, but it's giving me only ClientD from both tables not the Postcode. Could you please suggest why this is happening when I am using full outer join
ReplyDelete