in black and white
Main menu
Share a book About us Home
Biology Business Chemistry Computers Culture Economics Fiction Games Guide History Management Mathematical Medicine Mental Fitnes Physics Psychology Scince Sport Technics

The data worehouse ETL Toolkit - Kimpball R.

Kimpball R., Caserta J. The data worehouse ETL Toolkit - John Wiley & Sons, 2005. - 526 p.
SBN: 0-764-57923-1
Download (direct link): thedatawarehouse2005.pdf
Previous << 1 .. 52 53 54 55 56 57 < 58 > 59 60 61 62 63 64 .. 233 >> Next

Unfortunately, telling business users that data could not be derived and would not be available in the data warehouse was not an option. The team's solution was to redirect the report containing the required data to output to a flat file and use the prederived data from the report as a data source. Because of the nature of the report—fixed columns—it was simply treated as a positional flat file and processed with the rest of the ETL processes.
92 Chapter 3
Field Name Length Start End Type Comments
Record Type 2 1 2 AlphaNumeric Type of record can be either 'H' for header or 'D' for detail record.
SSN 9 3 11 Numeric Employee's Social Security Number.
First Name 20 12 31 AlphaNumeric First name of employee
Middle Initial 1 32 32 AlphaNumeric Middle initial of employee
Last Name 20 33 52 AlphaNumeric Last name of employee
Name Suffix 5 53 57 AlphaNumeric Jr, Sr, III, etc.
Birth Date 8 58 65 Numeric Employee's data of birth "YYYYMMDD".
Status Code 6 66 71 Numeric Employee's Status ('A','R', 'T', etc).
Office Code 2 72 73 Numeric The code of the employees branch office.
Department Code 2 74 75 Numeric The code of the employees department within his/her office.
Position Code 2 76 77 Numeric The code of the employees position in the organization.
Filler 1 78 78 AlphaNumeric Filler space.
Add Date 8 79 86 Numeric The date the record was add to the system.
Modified Date 8 87 94 Numeric The date the record last modified.
Figure 3.12 Fixed length flat file layout.
Processing a fixed length flat file requires a file layout to illustrate the exact fields within the file, as illustrated in Figure 3.12. A fixed length file layout should include the file name, where the field begins; its length; and its data type (usually text or number). Sometimes, the end position is supplied. If it is not, you have to calculate the end position of each field based on its beginning position and length if it is required by your ETL tool.
In most ETL tools, you most likely have to manually input the file layout of the flat file once. After the layout is entered, the tool remembers the layout and expects that same layout each time it interacts with the actual flat file. If the file layout changes or the data shifts off of its assigned positions, the ETL process must be programmed to fail. Unfortunately, unlike XML, no implicit validation of the file layout occurs when you process fixed length flat files—an explicit preprocess test must be successful before the data is processed.
¿ When processing fixed length flat files, try to validate that the positions of the data in the file are accurate. A quick check to validate the positions is to test any date (or time) field to make sure it is a valid date. If the positions are shifted, the date field most likely contains alpha characters or illogical numbers. Other fields with very specific domains can be tested in the same way. XML offers more concrete validation abilities. If data validation or consistency is an issue, try to convince the data provider to deliver the data in XML format.
Positional flat files are often indicated on the file system by a .TXT extension. However, positional flat files can have virtually any file extension—or none at all—and be processed just the same.
Extracting 93
Processing Delimited Flat Files
Flat files often come with a set of delimiters that separate the data fields within the file. Delimiters are used as an alternative to using positions to describe where fields begin and end. Delimited files can use any symbol or group of symbols to separate the fields in the flat file. The most common delimiter is the comma. Comma-delimited files can usually be identified by the .CSV extension on the file name. Obviously, however, other application-specific delimited flat files may simply have a .TXT extension or no extension.
Most ETL tools have a delimited file wizard that, once the developer indicates the actual delimiter characters, scans the flat file, or a sample of it, to detect the delimiters within the file and specify the file layout. Most often, the first row of delimited files contains its column names. The ETL tool should be intelligent enough to recognize the column names supplied in the first row to assign logical column names in the metadata layer and then ignore the row during all subsequent data processing.
Previous << 1 .. 52 53 54 55 56 57 < 58 > 59 60 61 62 63 64 .. 233 >> Next