DATA STAGE ONLINE TRAINING: May 2013

Explain a) impact to the customer dim (SCD1), b) impact to the fact tables.

In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the Is Active flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99's SK.

Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.

Question: What are the differences between Kimball approach and Inmon's? Which one is better and why?

if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon's we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that's the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon's approaches have advantages and disadvantages.
Purpose: a) to see if the candidate understands the core principles of data warehousing or they just "know the skin", b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there's right or wrong answer) or if they are blindly using Kimball for every situation.

Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages ?

We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; .

Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate's experience level is up to the expectation or not.

What are the advantages of having a normalized DW compared to dimensional DW ? What are the advantages of dimensional DW compared to normalized DW ?

The advantages of dimensional DW are:

flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model
performance, e.g. you can query it faster than normalized model,
it's quicker and simpler to develop than normalized DW and easier to maintain.

What is 3rd normal form? Give me an example of a situation where the tables are not in 3 rd NF.

No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is "transitively dependent" on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.

Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modeling?

There are many ways, but it should not be too far from this order:
1. Understand the business process,
2. Declare the grain of the fact table,
3. Create the dimension tables including attributes,
4. Add the measures to the fact tables (from Kimball's Toolkit book chapter
2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.

Monday, 6 May 2013

DATA STAGE ONLINE TRAINING