Datastage for Learners

Tuesday, 21 February 2017

Linux Command

1. Delete last 23 lines from a file in unix.

Use sed, but let the shell do the math, with the goal being to use the d command by giving a range (to remove the last 23 lines):

sed -i "$(($(wc -l < file)-22)),\$d" file

To remove the last 3 lines, from inside out:

$(wc -l < file)

Gives the number of lines of the file: say 2196
We want to remove the last 23 lines, so for left side or range:

$((2196-22))

Gives: 2714 Thus the original sed after shell interpretation is:

sed -i 2174,$d file

With -i doing inplace edit, file is now 2173 lines!

OR

   awk '{buf[NR-1]=$0;}END{ for ( i=0; i < (NR-23); i++){ print buf[i];} }'

Wednesday, 1 February 2017

Through Unix:
'1.telnet to DS server machine as ds administrative account (dsadm or root for example);
2. cd to $DSHOME - to go to DSEngine;
3. source dsenv by “. ./dsenv”; (read as dot_space_dot_dsenv)
4. run “bin/uvsh” to enter TCL prompt
5. run “>LIST.READU EVERY”, are you able to find any lock for the project under Active RecordLocks? If yes, mark down the Userno for the command below;
6 run “>UNLOCK USER userno ALL”;
7. run “>LIST.READU EVERY” again to verify the lock for the user is gone;
8. repeat step 6 and 7 until all locks for the project is clear;
9. run “>LO” to exit TCL prompt.

Tuesday, 24 January 2017

Sql Tuning and Performance Improvement

1. We used bulk load to our stage stable from MV and then rebuild the index. In this process we used to do Analyze table and gather stat.[The optimizer mode must be set to 'cost' not 'rule' to run the analyze command]

Sometimes we have changes for the tables, for example if we have regular ETL process that constantly modifies the database structure or table contents. The statistics that Oracle collects from the last table analyzation might be out of sync with the current data dictionary. And this information is often used by the optimizer when querying. Therefore querys base against the old statistics would run slow and the database performance could be decreased because of the changed statistic not being collected accordingly.
Oracle will collect statistic on the number of rows, the number of empty data blocks, the number of blocks below the high water mark, the average data block free space, the average row length, and the number of chained rows in a table when the Oracle ANALYZE TABLE command is performed.Oracle ANALYZE TABLE can be used to collect statistics on a specific table.

select 'Analyze Table '||table_name||' compute statistics;' from user_tables;

select 'Analyze Index '||index_name||' compute statistics;' from user_indexes;

2. Alter session to set parallel.

A DML statement can be parallelized only if you have explicitly enabled parallel DML in the session or in the SQL statement. To enable this mode in a session, run the following SQL statement:

ALTER SESSION ENABLE PARALLEL DML;

To enable parallel DML mode in a specific SQL statement, include the ENABLE_PARALLEL_DML SQL hint. For example:

INSERT /*+ ENABLE_PARALLEL_DML */ …

This mode is required because parallel DML and serial DML have different locking, transaction, and disk space requirements and parallel DML is disabled for a session by default.

To check the parallel processing status of your session, use the following query:

SELECT username, pq_status, pdml_status, pddl_status
FROM v$session
WHERE sid = sys_context('userenv','sid');

With SQL hints:

SELECT /*+ PARALLEL (16) */ …

The degree specified in the hint overrides the degree defined for the table or set in an ALTER SESSION statement.

If a degree number is not specified, then the degree defined for the table is used unless it is 1 (noparallel). Then the default degree is used.

SELECT /*+ PARALLEL */ …

You can also specify which table to use parallel processing on.

SELECT /*+ PARALLEL( TABLE1, 16 ) */

The hint syntax also works without the comma. In this case, the blank character works as a separator.

SELECT /*+ PARALLEL( TABLE1 16 ) */ …

If you don’t specify a table name, then parallel processing is used to read all the tables in the query.

You can also specify parallel processing for indexes.

SELECT /*+ PARALLEL_INDEX( TABLE1, 16 ) */ …

SELECT /*+ PARALLEL_INDEX( TABLE1 16 ) */ …

In this case, parallel processes will be used to read any indexes on table1, but will not be used to read table1 itself. Including an index name in the hint further restricts parallel processing to just that index.

SELECT /*+ PARALLEL_INDEX( TABLE1, INDEX2, 16 ) */ …

SELECT /*+ PARALLEL_INDEX( TABLE1 INDEX2 16 ) */ …

Hints can be used with query blocks if the SQL statement has sub-queries. For more information on hints, see the Oracle SQL Language Reference, chapter 3 on “Basic Elements of Oracle SQL”, section on comments. In the hint, do not prefix the table or index name with the schema name as this will not work. The SQL statement will run successfully, no error message will be displayed, but the hint will not work and parallel processes will not be used. This is standard hint behavior.

3. Flash bufferring.

We are keeping the reference tables and typecode tables in the Flash Cache since they are the most frequently used tables for our joining. We can also use keep buffer pool but that will store it in Ram instead of Hard Disk.

-- Force EMP to remain in the Smart Flash Cache, space provided.
ALTER TABLE scott.emp STORAGE (FLASH_CACHE KEEP);

-- Reset EMP to default use of Smart Flash Cache.
ALTER TABLE scott.emp STORAGE (FLASH_CACHE);
ALTER TABLE scott.emp STORAGE (FLASH_CACHE DEFAULT);

alter table emp storage (buffer_pool keep)

--> script for displaying flash_cache waits:

SELECT
   name,
   value
FROM
   v$sysstat
WHERE
   name IN
   ('physical read flash cache hits',
   'physical reads',
   'consistent gets',
   'db block gets',
   'flash cache inserts');

--> see flash_cache entries in the data buffers with the v$bh view, where the stats starts with flash:

select
owner||'.'||object_name
from
v$bh, dba_objects
where
v$bh.status like ('flash%');

Factors that we have used to optimize query response:-

collection of table statistics

memory allocation

parallelism

Saturday, 21 January 2017

Complex SQL queries

1. Top N Query
--> Return 5 top and bottom salary holder from a table

select name,sal from (select name,sal from emp order by sal desc ) where rownum <=5;

select name,sal from (select name,sal from emp order by sal ) where rownum <=5;

--> Select Max salaried employee from department

select name,sal,dept from emp e,(select max(sal) max_sal,dept from emp group by dept) d
where e.dept=d.dept and e.sal=d.max_sal;

2. Ranking
--> Return 5 top salary holder from each department of a emp table

select name,sal,dept from (select name,sal,dept, rank() over (partition by dept order by sal desc) as sal_rank from emp) where sal_rank <=5;

3. Window function
--> Provide a summary of the current row’s salary value together with the salary values of all previous rows whose HIRE_DATE value falls within 90 days preceding the HIRE_DATE value of the current row

select last_name, first_name, department_id, hire_date, salary, SUM (salary) OVER (PARTITION BY department_id ORDER BY hire_date RANGE 90 PRECEDING) department_total from employee order by department_id, hire_date;

4. Hierarchical Query
--> How to find out one one employee's subordinate chain?

SELECT emp_id,emp_name,mgr_id from emp
connect by prior emp_id=mgr_id;

--> How to find out which employee at what lever in an Organization?

SELECT emp_id,emp_name,mgr_id,LEVEL from emp

connect by prior emp_id=mgr_id;

--> Display Each employee and their immediate manager in a department.

SELECT emp_name "Employee", CONNECT_BY_ROOT emp_name "Manager"
where LEVEL=2 and dept_id=101
connect by prior emp_id=mgr_id;

--> Display the all the superior name of a employee

SELECT emp_name "Employee", employee_id, CONNECT_BY_ROOT emp_name "Manager",manager_id, LEVEL
      FROM employees
      START WITH employee_id = 100
      CONNECT BY PRIOR employee_id = manager_id;

5. De duplication
--> How to find de duplicated rows based on come keys.

SELECT * from ABC where rowid in (select max(rowid) from ABC group by A1,B1,C1);

6. Join with self tables
--> How do you find all employees which are also manager?

SELECT e.name, m.name FROM Employee e, Employee m WHERE e.mgr_id = m.emp_id;

7. Sub query and joining

-->How to find all the employee who joined last one year?

select *,sysdate - to_date(hiredate,'dateformat') datediff from emp where datediff between 0 and 365;

Tuesday, 25 October 2016

Temp Tablespace

select distinct(a.tablespace_name), b.total_size ,b.used_size - c.free_space used_size,b.space_available + c.free_space SPACE_AVAILABLE
from dba_tables a,
(select sum(MAXBYTES)/1024/1024/1024 total_size, sum(BYTES)/1024/1024/1024 used_size, sum(MAXBYTES)/1024/1024/1024 - sum(BYTES)/1024/1024/1024 space_available, TABLESPACE_NAME
FROM DBA_DATA_FILES
group by TABLESPACE_NAME
) b,
(select
ddf.tablespace_name
,sum(dfs.bytes)/1024/1024/1024 free_space
from dba_data_files ddf, dba_free_space dfs
where ddf.file_id = dfs.file_id
group by ddf.tablespace_name ) c
where
a.tablespace_name=b.tablespace_name
and a.tablespace_name=c.tablespace_name
--and a.tablespace_name like '%SRCSTGP04%' –if tablespace is known then mention here.
order by 4;

Sunday, 11 September 2016

Difference between Informatica and Datastage

I have used both Datastage and Informatica... In my opinion, DataStage is way more powerful and scalable than Informatica. Informatica has more developer-friendly features, but when it comes to scalabality in performance, it is much inferior as compared to datastage.

Here are a few areas where Informatica is inferior -

1. Partitioning - Datastage PX provides many more robust partitioning options than informatica. You can also re-partition the data whichever way you want.

2. Parallelism - Informatica does not support full pipeline parallelism (although it claims).

3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage supports hash files, lookup filesets, datasets for much more efficient lookup.

4. Merge/Funnel - Datastage has a very rich functionality of merging or funnelling the streams. In Informatica the only way is to do a Union, which by the way is always a Union-all.

Thursday, 1 September 2016

Datastage Interview Question

http://www.datawarehousing-praveen.com/2013/10/datastage-interview-questions-part-1_720.html