in black and white
Main menu
Share a book About us Home
Biology Business Chemistry Computers Culture Economics Fiction Games Guide History Management Mathematical Medicine Mental Fitnes Physics Psychology Scince Sport Technics

The data worehouse ETL Toolkit - Kimpball R.

Kimpball R., Caserta J. The data worehouse ETL Toolkit - John Wiley & Sons, 2005. - 526 p.
SBN: 0-764-57923-1
Download (direct link): thedatawarehouse2005.pdf
Previous << 1 .. 135 136 137 138 139 140 < 141 > 142 143 144 145 146 147 .. 233 >> Next

Not all data marts contain bridge tables, but when they do, the tables must be loaded immediately after the dimensions but before the fact table load starts. If a fact table depends on a bridge table, the bridge table load must complete successfully before the fact table load can be executed. If you attempt to load the fact table with the bridge table partially loaded, groups will be missing from the table, and data from the fact table will become suppressed when it is joined to the bridge table.
Information on loading bridge tables can be found in Chapter 5; techniques for using a bridge table to find the groups while loading facts are found in Chapter 6.
Fact Tables
Fact tables are dependent on virtually all other tables in the dimensional data model and are usually loaded last. Once the subdimensions, dimensions, and bridge tables are loaded, the fact table has all of the look-ups it needs and is ready to be loaded. Remember, RI is enforced here, so you must ensure that every foreign key in the fact table has an associated primary key in its relative dimension or bridge table.
Fact tables typically take longest of all the different types of tables in the data warehouse to load; you should begin the fact table load process as soon as all of its related tables are loaded. Do not wait for all of the dimensions in the data warehouse to load before kicking off the fact load. Only the dimensions and bridge tables directly related to the fact table need to complete before the associated fact table load can begin.
Because of the extreme volume of data usually stored in fact tables, it's a good idea to process their loads in parallel. The scheduler should spawn the ETL process into multiple threads that can run concurrently and take advantage of parallel processing. The next chapter discusses more about optimizing your fact table loads.
286 Chapter 7
Hierarchy Mapping Tables
Hierarchy mapping tables are specially designed to traverse a hierarchy that lives within a dimension. See Chapter 5. Hierarchy mapping tables are not dependent on facts or bridge tables (unless, of course, the fact table itself contains the hierarchy). Technically, hierarchy tables can be loaded immediately following their relative dimension load, but we recommend loading them at the end of the data-mart process to enable the long-running fact table loads to begin, and finish, sooner.
Regardless of where the hierarchy is physically placed in a batch, its success or failure should have no bearing on the other processes in the batch. Don't kill the launch of a fact table process because of a failure in a hierarchy mapping table. The mapping table can be restarted independently of any fact table load.
The Effect of Aggregates and Group Bys on Performance
Aggregate functions and the Group By clause require databases to utilize a tremendous amount of temp space. Temp space is a special area managed by the DBMS to store working tables required to resolve certain queries that involve sorting. Most DBMSs attempt to perform all sorting in memory and then continue the process by writing the data to the temp space after the allocated memory is full. If you attempt to build aggregates for the data warehouse with SQL, you have a few issues to address.
SQL is processed on the server where it is executed. That means that if you attempt to aggregate data in your extract query, you will likely blowout the allocated temp space in the source transaction system. By design, transaction systems keep their temp space very small compared to the space allocated on data warehouses. When you need to build aggregate tables, it's good practice to utilize the ETL engine or a third-party tool specifically dedicated to sorting data at lightning-fast speeds.
You should adjust your aggregates incrementally with a dedicated tool that supports incremental updates to aggregates.
Do not attempt to execute aggregating SQL with a Group By clause in your data extraction query. The Group By clause creates an implicit sort on all of the columns in the clause. Transaction systems are typically not configured to handle large sort routines, and that type of query can crash the source database. Extract the necessary atomic-level data and aggregate later in the ETL pipeline utilizing the ETL engine or a dedicated sort program.
Development 287
Performance Impact of Using Scalar Functions
Scalar functions return a single value as output for a single input value. Scalar functions usually have one or more parameters. As a rule, functions add overhead to query performance, especially those that must evaluate values character by character. The following functions are known performance inhibitors:
This list is not exhaustive. It is served as an example to get you thinking about the different types of functions available in your database. For example, TO_CHAR() is a data-type conversion function. If TO_CHAR() inhibits performance, you can imagine that TO_DATE() and TO_ NUMBER() also do. Try to substitute database functions with operators. For example, in Oracle, the CONCAT() function can be replaced with the double pipe || to concatenate two strings.
Previous << 1 .. 135 136 137 138 139 140 < 141 > 142 143 144 145 146 147 .. 233 >> Next