Sunday, 21 June 2015

Join Stage

Join Stage:
The Join stage is a processing stage. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based
on the values of key columns. The other two are:

The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input.
If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically
does, cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.
In the below sections will show join stage configuration in datastage jobs:
Left Dataset:
CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 POOJI
3 KATRU
Right Dataset or Intermediate Dataset:
CUSTOMER_ID CITY ZIP_CODE
1 CYPRESS 90630
2 CYPRESS 90630
Scenario 1)
InnerJoin:Inner transfers records from input data sets whose key columns contain equal values to the output
data set. Records whose key columns do not contain equal values are dropped.
Step1)Configure sequential file stage to read both inputs.

LeftInput

Right_Intermediate_Input
Step2)Configure Join stage
In order to specify join key columns both column names should be same from both the inputs.In case of lookup stage this is not required which is also a difference between join and lookup stages.
Click on properties tab in the stage page and select key columns to perform the join.In our case we are using customer_id as key column .

stg_Join_properties
Join stage expects both inputs need to be partitioned and sorted based on join key columns.Go to the partitioning tab of input stage page and select hash partitioning as shown in below figure.

Go to the Advanced tab in stage page and clear the preserve partition flag to avoid warning while writing into sequential file stage.If you are using dataset then you can use propagate becuase dataset is parallel stage it will keep the exact partitions.

Go to the Mapping tab of the output page and select the required columns in the output

Go to the Link Ordering tab in the stage page and check the input link order.

Compile and run the job:

I am adding new column Customer Name in the right link to explain a scenario where non-key column name should be different from both inputs otherwise will get warning from the datastage job.

Compile and run the job and check the job log.We got following warning in the job.
stg_InnerJoin: When checking operator: Dropping component “CUSTOMER_NAME” because of a prior component with the same name.
Note:Non Key column names should be different from both the inputs to the join stage to avoid the above warning.
2)Left Outer Join

Compile and run the job and check results:
If you observe the output you can see all 3 left rows came out from the join stage.

Join stage doesnot have reject link to capture unmatched rows.Add a transformer stage after join stage and check for unmacted rows and write it to a another link.
when a match does not occur, the Join stage inserts values into the unmatched non-key column(s) using the following rules:
a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant value,
to each of the inner links and test that column for the constant after you have performed the join. This
isolates your match/no-match logic from any changes in the metadata. This is also handy with Lookups
that have multiple reference links.
Add a column generator stage and define MATCH_IND column and specify value = ‘Y’.In the transformer stage check If MATCH_IND=’Y’ then matched else unmatched.

Compile and run the job:

 

 

Get to know the facts of Join Stage in Datastage

The "Join" stage is a processing stage that performs a join operation on two or more  input information sets and afterward gives output as one resultant data set.
There are Four sorts of joins in DataStage:
1. Left Outer Join: It joins two or more tables on JOIN condition where result table preserves record from first (Left) table and puts NULL wherever unmatched record from second (Right) table.
2. Right Outer Join: It is precisely inverse to the Left external join. It joins two tables on JOIN condition where result table preserves record from first (Right) table and puts NULL wherever unmatched record from second (Right) table.
3. Inner Join: It joins two or more tables and returns just those records that fulfill join condition.
4. Full Outer Join: It joins two or more tables and returns both coordinated and unmatched records from all tables. 
 Design your job like below.
Step 2 : Configure Table 1 and Table 2 in Sequential record 1 and Sequential document 2.
Step 3 : While designing section name, verify that you ought to give right Datatype against every segment overall output won't be appropriately sort. Give "Number" Datatype for "ID" where we can utilize "varchar" too for it.
Step 4 : Double click on the JOIN stage. Under Properties tab notice 'Key = ID'. Beneath this tab you can notice Join Type as Inner, Left outer or Right Outer. Now, we should Practice for Left outer Join first.
Step 5 : Beneath 'link order' tab you can say or mention Left and right link on that Type of join is depends. i.e.Left or Right external. Order of link doesn't make a difference for Inner Join.
Step 6 : Map the obliged section under output stage.
Step 7 : Link order is depends on key section need to mapped. Those are not mapped will be shown by Red shading or it will indicate lapse while accumulating.

 

Difference between Join Stage and Look Up Stage in Datastage 

Join stage and Look Up stage have some different input requirements.

Based on the Requirement we use these Stages which is good for the performance.

We need to see weather we get good performance by using any stage in datastage.

And the stage supports the required inputs are not.


Lets say Join Stage= J.S And Look Up = M.S


J.S - The input names of the Join Stage are Left tables , Right Tables and Intermediate Tables. That means we call the left one as a Left table and right one as a Right table and remaining tables between these tables are call it as Intermediate tables. ( That can be any number of tables in between )


L.S - The input names of the Look Up stage are Primary Tables and Reference Tables. That means First table will be considered as a Master tables and remaining any number of tables are considered as a Update tables.

J.S - We can perform four types of Joins in Join Stage. That means it supports all the four types of Joins. They are
Inner Join
Left Outer Join
Right Outer Join
Full Outer Join

L.S - We can perform only two types of Joins in Look Stage. That means it supports two types of Joins here. And they are
Inner Join
And Left Outer Join

J.S - The Input requirements of Join stage are

There will be a N-Inputs ( In the case of Left, Inner, Right Outer Joins)
There will be a 2 Inputs ( In the case of Full Outer Join)

And there will be a 1 Output link and
there will be no reject links in Join stage.
L.S - The input requirements of Look Up Stage are as follows

There will be a N-Inputs ( In the Case of Normal Stage)
2 Inputs (In the Case of Sparse Look Up )
1 Output
And 1 Reject Link.

J.S - And Coming to Memory type. This is light memory Usage

L.S - It is a Heavy Memory Usage


J.S - Key Column Names should be Same. That is Primary record should be same with Secondary Records

L.S - Key column names Optional.
It should be same in the case of Sparse Look U
.

The Inner Join Type are as follows

J.S - Primary Records Should match with all secondary

i.e

J.S - Primary Records should match with all secondary.

The Input requirements with respect to Sorting are as follows.

J.S - In Join Stage Primary records and Secondary records should be sorted when coming( i.e data sorting is mandatory).

L.S - In Look Up stage it is Optional. That is all the primary and secondary records no need to be sorted.


And Treatment of Unmateched Records will be as follows

J.S - OK for the Primary and Secondary Records if the data is Unmatched records.

L.s - Ok for the Primary and we get warning if secondary records are unmatched.

What To Choose Join Stage or Lookup stage in Datastage

We need to be careful when selecting the stages. We need to think about the

performance of the Job before selecting the stages. Time is more precious to the

clients. That's why we need to get the Job for very less time. We need to try

our best to get good performance to the Job.

Both the stages Join stage and Look up stage performs same thing. That is they

combine the tables we have. But why Lookup stage has been introduced.

Look Up Stage have some extra benefits which will not come with the Join stage.

Look up stage does not required the data to be sorted. Sorting is mandatory with

the Join stage. In Look Up stage the columns with different column names can be

joined as well where it is not possible in the Join stage. That means Join stage,

the column name must be similar.

A Look Up Stage supports reject links , if our required demands reject links we

can’t go with Join stage. Because Join stage doesn’t supports Reject Links. And

Lookup stage has an option to fail the Job if the look up fails. It will be useful

when the look up stage is expected to be successful.
text-align: left;">
Look up stage keeps the reference data into the memory which yields better

performance for smaller volume of data. If you have large amount of data, you

need to go with Join stage.

1 comment:

  1. Hi. I am new to Datastage. I am using join stage to join two tables (Which have columns with same name CLient_ID and POstcode) from DRS connector. I am using full outer join because I need columns from both tables, but it's giving me only ClientD from both tables not the Postcode. Could you please suggest why this is happening when I am using full outer join

    ReplyDelete