Saturday, 21 September 2013

DATASTAGE INTERVIEW QUESTIONS

  Explain a) impact to the customer dim (SCD1), b) impact to the fact tables.
  • In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the Is Active flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99's SK.
  • Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.
 Question: What are the differences between Kimball approach and Inmon's? Which one is better and why ?
  • if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon's we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that's the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon's approaches have advantages and disadvantages.
  • Purpose: a) to see if the candidate understands the core principles of data warehousing or they just "know the skin", b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there's right or wrong answer) or if they are blindly using Kimball for every situation.
  Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages  ?
  • We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; .
  • Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate's experience level is up to the expectation or not.
What are the advantages of having a normalized DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalized DW ? 
The advantages of dimensional DW are:
a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model
b) performance, e.g. you can query it faster than normalized model,
c) it's quicker and simpler to develop than normalized DW and easier to maintain.
   What is 3rd normal form? Give me an example of a situation where the tables are not in 3 rd NF.
No column is transitively depended on the PK. For example, column1 is dependent on column2 and column2 is dependent on column3. In this case column3 is "transitively dependent" on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.
   Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modeling?
There are many ways, but it should not be too far from this order:
1. Understand the business process,
2. Declare the grain of the fact table,
3. Create the dimension tables including attributes,
4. Add the measures to the fact tables (from Kimball's Toolkit book chapter
2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.

Wednesday, 31 July 2013

Datastage Course

 Datastage Course Curriculum
 
    Overview of the Data Stage
  • Introduction to Data Warehousing
  • What is Data Warehousing?
  • Who needs Data Warehousing?
  • Why Data Warehouse is required?
  • Types of Systems (OLTP, DSS, OLAP)
  • Maintenance of Data Warehouse
  • Data Warehousing Life Cycle
    Data Warehousing Architecture
  • Source
  • Integration Layer
  • Staging Area
  • Target
  • Analysis & Reporting
  • IBM WEBSHPERE DATA STAGE VERSION 8.1 Contents
  • Introduction about Data Stage
  • IBM Information Sever architecture
  • Datastage within the IBM Information Sever architecture
  • Difference between Server Jobs and Parallel Jobs
  • Difference between Pipeline and Partition Parallelism
  • Partition techniques (Round Robin, Random, Hash, Entire, Same, Modules, Range, DB2, Auto)
  • Configuration file
  • Difference between SMP/MPP Architecture
  • Data stage components (Server / Client components)
  • Package installer
    Using Controls
  • Layouts
  • Phase and Checkpoint
  • Project and Sandbox
  • Sub graphs
  • Deadlocks
  • Continuous Flows
  • Interpreting the Log File
  • Monitoring the Graph
  • Wrapper Scripts
  • Brief intro about EME
  • Brief intro about Shop For Data (SFD)
  • Brief intro about Conduct It
  • Brief intro about Data Profiler

Saturday, 22 June 2013

DATASTAGE INTRODUCTION

  DATA STAGE
Data Stage is a powerful Extraction, Transformation, and Loading tool. Data Stage has the following features to aid the design and processing:
  • Uses graphical design tools. With simple point-and-click techniques you can draw a scheme to represent your processing requirements
  • Extracts data from any number or type of database
  • Handles all the metadata definitions required to define your data warehouse or migration. You can view and modify the table definitions at any point during the design of your application
  • Aggregates data. You can modify SQL SELECT statements used to extract data
  • Transforms data. Data Stage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use.
IBM WebSphere Data Stage TX delivers the ability to easily and seamlessly automate high-volume, complex transactions without the need for additional coding-resulting in a quick return on investment.
IBM WebSphere Data Stage TX 6.7 delivers rapid ROI through a highly scalable, open architecture.

IBM WebSphere Data Stage TX allows you to: >br /> Quickly and seamlessly integrate many of the most popular applications, databases, and messaging systems Respond quickly and with flexibility to rapidly evolving, strategic business requirements Leverage the value of your enterprise systems and applications Meet your project-based or full integration solution needs

IBM WebSphere Data Stage Features & Benefits
IBM WebSphere Data Stage provides these unique capabilities: The Most Powerful ETL Solution - Supports the collection, integration and transformation of high volumes of data, with data structures ranging from simple to highly complex. Data Stage manages data arriving in real-time as well as data received daily, weekly or monthly.

The Most Scalable Platform
Enables companies to solve large-scale business problems through high-performance processing of massive data volumes. By leveraging the parallel processing capabilities of multiprocessor hardware platforms, DataStage Enterprise Edition can scale to satisfy the demands of ever-growing data volumes and stringent real-time requirements and ever shrinking batch windows

The Most Comprehensive Source and Target Support
Supports a virtually unlimited number of heterogeneous data sources and targets in a single job, including: text files; complex data structures in XML; ERP systems such as SAP and PeopleSoft; almost any database (including partitioned databases); web services; and SAS.

Real-time Data Integration Support
Operates in real-time capturing messages from Message Oriented Middle ware (MOM) queues using JMS or WebSphere MQ adapters to seamlessly combine data into conforming operational and historical analysis perspectives. IBM WebSphere RTI Services is a service-oriented architecture (SOA) enabling middleware that brokers enterprise-wide the benefits of the Ascential Enterprise Integration Suite across a continuum of time constraints, application suites, interface protocols and integration technologies.

Advanced Maintenance and Development
Gives developers maximum speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure. Full data integration reduces the development and maintenance cycle for data integration projects by simplifying administration and maximizing development resources.

Monday, 6 May 2013

DATA STAGE ONLINE TRAINING

  Explain a) impact to the customer dim (SCD1), b) impact to the fact tables.
  • In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the Is Active flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99's SK.
  • Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.
  Question: What are the differences between Kimball approach and Inmon's? Which one is better and why?
  • if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon's we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that's the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon's approaches have advantages and disadvantages.
  • Purpose: a) to see if the candidate understands the core principles of data warehousing or they just "know the skin", b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there's right or wrong answer) or if they are blindly using Kimball for every situation.
  Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages ?
  • We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; .
  • Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate's experience level is up to the expectation or not.
  What are the advantages of having a normalized DW compared to dimensional DW ? What are the advantages of dimensional DW compared to normalized DW ?

  The advantages of dimensional DW are:
  • flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model
  • performance, e.g. you can query it faster than normalized model,
  •  it's quicker and simpler to develop than normalized DW and easier to maintain.
  What is 3rd normal form? Give me an example of a situation where the tables are not in 3 rd NF.
No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is "transitively dependent" on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.
 Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modeling?
There are many ways, but it should not be too far from this order:
1. Understand the business process,
2. Declare the grain of the fact table,
3. Create the dimension tables including attributes,
4. Add the measures to the fact tables (from Kimball's Toolkit book chapter
2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.

Sunday, 28 April 2013

Datastage Training

Datastage Course Curriculum

   Overview of the Data Stage
  • Introduction to Data Warehousing
  • What is Data Warehousing?
  • Who needs Data Warehousing?
  • Why Data Warehouse is required?
  • Types of Systems (OLTP, DSS, OLAP)
  • Maintenance of Data Warehouse
  • Data Warehousing Life Cycle
   Data Warehousing Architecture
  • Source
  • Integration Layer
  • Staging Area
  • Target
  • Analysis & Reporting
  • IBM WEBSHPERE DATA STAGE VERSION 8.1 Contents
  • Introduction about Data Stage
  • IBM Information Sever architecture
  • Data stage within the IBM Information Sever architecture
  • Difference between Server Jobs and Parallel Jobs
  • Difference between Pipeline and Partition Parallelism
  • Partition techniques (Round Robin, Random, Hash, Entire, Same, Modules, Range, DB2, Auto)
  • Configuration file
  • Difference between SMP/MPP Architecture
  • Data stage components (Server / Client components)
  • Package installer
   Using Controls
  • Layouts
  • Phase and Checkpoint
  • Project and Sandbox
  • Sub graphs
  • Deadlocks
  • Continuous Flows
  • Interpreting the Log File
  • Monitoring the Graph
  • Wrapper Scripts
  • Brief intro about EME
  • Brief intro about Shop For Data (SFD)
  • Brief intro about Conduct>It
  • Brief intro about Data Profiler