Books
in black and white
Main menu
Share a book About us Home
Books
Biology Business Chemistry Computers Culture Economics Fiction Games Guide History Management Mathematical Medicine Mental Fitnes Physics Psychology Scince Sport Technics
Ads

The data worehouse ETL Toolkit - Kimpball R.

Kimpball R., Caserta J. The data worehouse ETL Toolkit - John Wiley & Sons, 2005. - 526 p.
SBN: 0-764-57923-1
Download (direct link): thedatawarehouse2005.pdf
Previous << 1 .. 132 133 134 135 136 137 < 138 > 139 140 141 142 143 144 .. 233 >> Next

Preparing for Bulk Load
Many of the ETL tools on the market today can stream data directly from their tool through the database bulk-load utility into the database table. But not all of the tools utilize the bulk-load utilities the same way. Some are more efficient than others, and some require extra plug-ins to make them compatible with bulk loaders. Regardless of how you pass data, as an ETL developer, it is important that you understand how to prepare your data to be processed by a bulk-load utility.
Bulk loading is the most efficient way to get data into your data warehouse. A bulk loader is a utility program that sits outside of the database and exists for the sole purposes of getting large amounts of data into the database very quickly. Each database management system has a different, proprietary, bulk-load utility program. The popular ones are listed in Table 7.3.
Generally speaking, the various bulk-load utilities work in the same way. For the purpose of illustrating the functionality of a bulk-load utility, we'll discuss Oracle's SQL*Loader; at the time of this writing, we believe it is the common denominator of bulk loaders in the domain of data warehouses.
Development 279
Table 7.3 Bulk Load Utilities
BULK LOAD UTILITY NAME COMMENTS I
Oracle SQL*Loader Requires a control file that
describes the data file layout.
Two important parameters for
optimal performance are:
DIRECT={TRUE | FALSE}
PARALLEL={TRUE | FALSE}
Microsoft SQL Bulk Copy Program Microsoft also offers BULK
Server (BCP) INSERT that can be faster than
BCP. It saves a significant
amount of time because it
doesn't need to utilize the
Microsoft NetLib API.
IBM DB2 DB2 Load Utility DB2 accepts Oracle Control and
Data files as input sources.
Sybase Bulk Copy Program Also supports DBLOAD with the
(BCP) parameter BULKCOPY = 'Y'.
Once you understand the general concepts of bulk loading, the similarities among the loaders makes learning each specific utility a breeze.
Bulk loaders typically need two files to function properly:
¦ Data file. The data file contains the actual data to be loaded into the data warehouse. Data can be in various file formats and layouts, including a variety of delimiters. All of these parameters are defined in the control file.
“ Control file. The control file contains the metadata for the data file. The list of the various parameters is extensive. Following is a list of the basic elements of SQL*Loader control file.
¦ The location of the source file
ψ Column and field layout specifications
¦ Data-type specifications
ψ The data mapping from the source to the target
¦ Any constraints on the source data
¦ Default specifications for missing data
“ Instructions for trimming blanks and tabs
¦ Names and locations of related files (for example, event log, reject, and discarded record files)
280 Chapter 7
For a comprehensive guide to SQL*Loader command syntax and usage, refer to Oracle SQL*Loader: The Definitive Guide, by Jonathan Gennick and Sanjay Mishra (O'Reilly & Associates, 2001).
3 Even if you must pay the penalty for the I/O of writing data to a physical file before it is bulk loaded into the data warehouse, it is still likely to be faster than accessing the database directly and loading the data with SQL INSERT statements.
Many ETL tools can pipe data directly into the database via the bulk-load utility without having to place data on disk until it hits its final destination— the data warehouse fact table. Others can create the required control and data files on your files system. From there, you need to write a command-line script to invoke the bulk loader and load the data into the target data warehouse.
The main purpose of purchasing an ETL tool is to minimize hand-coding any routines, whether extracting, transforming, or loading data. But no tool on the market can solve every technical situation completely. You'll find that seamlessly pipelining data through bulk loaders or any third-party load utilities will be a bit of a challenge. Experiment with tool plug-ins and other application extenders such as named pipes, and exhaust all options before you determine bulk loading is not feasible.
If you have not yet purchased your ETL tool, make sure to test potential products for their compatibility with your DBMS bulk-load utility during your proof-of-concept. If you already own an ETL tool and it cannot prepare your data for bulk loading, do not throw it away just yet. You need to prepare the data manually. Configure your ETL tool to output your data to a flat file, preferably comma delimited. Then, create a control file based on the specifications of the output file and required load parameters. The control file should need to be changed only when physical attributes change within the source or target, like when new columns are added or data types are modified.
Previous << 1 .. 132 133 134 135 136 137 < 138 > 139 140 141 142 143 144 .. 233 >> Next