Download (direct link):
In most cases, the ETL process needs to normalize any data contained in a OCCURS section of a mainframe file. Even though it is possible to manually program the ETL process to manage the repeating data, it is strongly recommended that you use a robust ETL tool that allows you to use the COBOL copybooks to define inputs or at least allows you to manually define input file arrays in some other way. If your tools do not support input arrays, you are stuck with the toil of writing code to deal with repeating groups within records sourced from your legacy mainframe systems.
Sometimes programmers use OCCURS to store different facts in an array, rather than storing the same fact N times. For example, suppose O-DATE occurs four times. The first date is CREATE, the second is SHIP, the third is ACKNOWLEDGE, and the fourth is PAYMENT. So in this case you don't normalize this OCCURS data but rather create discrete fields for each position in the array.
To ensure data integrity, model data that results from a COBOL OCCURS clause in a normalized fashion—a master table and child table—in the staging area of the data warehouse. It's good practice to stage this data in separate tables because the result of the process most likely loads data
into a fact table and a dimension, two separate dimensions, or two separate fact tables. We find that in these situations it makes sense to set data down to settle before integrating it with the data warehouse.
Managing Multiple Mainframe Record Type Files
The concept of multiple record type files is touched upon in the section that discusses REDEFINES. The main difference between REDEFINES as discussed earlier and what we're introducing now is that instead of having just a small portion of a record contain multiple definitions, the entire record has multiple definitions. Multiple record types are often used to span a single logical record across two or more physical records. Figure 3.10 contains an extract of a COBOL copybook that illustrates the concept of redefining an entire record.
In Figure 3.10, the REDEFINES clause applies to the entire record. So now, instead of the file carrying only an employee's basic information, it also carries an employee's job history with the company as well. In this file, every employee has at least two records: one EMP-RECORD and one JOB-RECORD. When an employee transfers to a new job, a new JOB-RECORD is added to the file. So an employee's total job history is contained in two or more records on the file: one EMP-RECORD and one or more JOB-RECORD(s).
In this file, the physical order of the records is critically important because the JOB-RECORDs do not have any information to link them to their corresponding EMP-RECORDs. The JOB-RECORDs for an employee follow immediately after his or her EMP-RECORD. So to accurately process the job history of an employee, you must treat two or more physically adjacent records as one logical record.
The benefit of using multiple record types is—once again—to save space. The alternative, without using relational theory, is to have extremely wide, space-wasting records to carry all data in a single record. If, for example, you want to track job history for up to five prior positions, you have to add 255 bytes to each employee record (the base EMP-RECORD, plus five occurrences of JOB-RECORD fields (5 x 51 bytes). But the number of job history field segments is situational—it depends on how many jobs an employee has held.
By using multiple record types, the mainframe system can store job history records only as needed, so employees with only one job require only one JOB-RECORD (70 bytes including FILLER), saving 185 bytes on the file. Furthermore, you are no longer limited to a fixed number of jobs in file. An unlimited number of 70-byte JOB-RECORDs can be added for each employee.
88 Chapter 3
05 REC-TYPE PIC 1
05 FIRST-NAME PIC X(
05 MIDDLE-INITIAL PIC X.
05 LAST-NAME PIC 5
05 SSN PIC cn
10 DOB-YYYY PIC 9 (4) .
10 DOB-MM PIC 9 (2) .
10 DOB-DD PIC 9 (2) .
05 EMP-ID PIC ñî
10 HIRE-YYYY PIC 9 (4) .
10 HIRE-MM PIC 9 (2) .
10 HIRE-DD PIC 9 (2) .
10 TERM-YYYY PIC 9 (4) .
10 TERM-MM PIC 9 (2) .
10 TERM-DD PIC 9 (2) .