• Print  |
  • Feedback  |

Knowledge Base


TS-642

Reading EBCDIC Files on ASCII Systems

NOTE: A Glossary of terms is included at the end of this document.

Introduction

This document will focus on reading EBCDIC files using SAS running on a PC which is one of the more commonly asked questions in Technical Support.

When reading an external file into a SAS data set, SAS expects to find the data in the normal file structure for the operating system (OS) on which SAS is running. For instance, if you are running SAS on a PC with a Windows OS, SAS will expect ASCII encoded data with each line ending with a carriage return (CR) and line feed (LF). On a mainframe running MVS, SAS will not expect a CR & LF because the file structure does not use End-of-Record (EOR) markers. More on file structures will be discussed later in this paper.

EBCDIC is the character encoding system of mainframes and ASCII is the encoding system on other machines such as VAX, PC, UNIX, and Macintosh. These two character sets represent the same data differently. For instance on an EBCDIC system the number 4 has a value of 'F4'x, while the ASCII value is '34'x. And the same hex value will have totally different meanings on different systems. For instance, the value '50'x is a '&' in EBCDIC, but a 'P' in ASCII. Hexadecimal values are written in the form of 'hl'x, where the 'h' value is the high order nibble(HON) and 'l' is the low order nibble(LON). Taken together these two nibbles form a single byte of character or numeric data. Hexadecimal values will have quotes before and after and an x following the closing quote.

As long as EBCDIC files stay on mainframes and ASCII files on PCs, etc. there is no difficulty reading or writing data or understanding what's in the file. However when data is generated on a mainframe system, but needs to be used on a PC system, generally, but not always, it must be moved from the EBCDIC system to the PC system. There is more than one way to accomplish this. The most common method is using a file transfer protocol (FTP) program to move the data between systems. The other method is to read it in place. Both methods will be covered later in this document.

Mainframe File Structure

On mainframes there are two different file structures, fixed and variable. Each file has a header that contains information about the file. In a fixed length file (RECFM=F), every record is exactly the same length. In the header an attribute called Logical Record Length, (LRECL) contains the length of each record in the file. The LRECL is available to SAS when the file is opened with an INFILE statement, so it doesn't need to be specified on the INFILE statement.

For the variable length file (RECFM=V) the LRECL is the maximum record length in the file. The actual length, number of bytes in the record, is stored at the beginning of each record. This information is called the Record Descriptor Word (RDW). The RDW is a 4 byte Integer Binary (IB) field. The length of the record is stored in the first two of those bytes. The last two bytes are reserved by the OS. The number stored is the physical length of the record + 4 for the RDW. The file also contains a Block Descriptor Word (BDW). If the block size is not specified when the file is created, it will be set by the operating system to the value of the Logical Record Length (LRECL) + 4. If the block size is defined when the file is created, it will have the specified value. Each block can have one or more records contained in it. The BDW contains the number of bytes stored in the block which will be the sum of the RDWs contained in the block + 4. The BDW is also a 4 byte IB field where the length of the block is stored in the first two bytes and the last 2 bytes are reserved by the operating system, just like the RDW.

When SAS reads a file on the mainframe, the OS tells SAS everything it needs to know about the file. The RECFM and LRECL are part of the header for that file. If the RECFM=F, SAS knows that every record will be exactly the same length and the LRECL tells us the length. If the RECFM=V, SAS will read the RDW to find out how many bytes are in that record and pull those bytes into the input buffer for processing by the INPUT statement.

PC File Structure

On the PC, no information is stored in the header regarding the record format or length of records because none is needed. As far as the OS is concerned, they are all the same. The RECFM is variable and the LRECL is unlimited. This is because the standard file structure uses an End-Of-Record (EOR) marker to flag the end of each record. On the PC the EOR is a carriage return and line feed (CR & LF). Other systems use different combinations. For instance, UNIX systems use just the LF and Macintosh uses just the CR. A CR is '0D'x and a LF is '0A'x.

When SAS reads a file from disk on a PC, we have to establish defaults because no specifics about the file exist. The default RECFM is V and LRECL is 256. This means that SAS will scan the input record looking for the CR & LF to tell us that we are at the EOR. If the marker isn't found within the limit of the LRECL, SAS discards the data from the the LRECL value to the end of the record and adds a message to the Log that "One or more lines have been truncated". For instance, if you accept the default LRECL of 256, but have 300 bytes of data on a record, SAS will parse the first 256 according to the INPUT statement, discard the last 44 and write the "One or more lines have been truncated" message to the SAS log. Using the LRECL= option on the INFILE statement overrides the default.

Standard and Non-Standard Numeric Data

In Standard numeric values for both types of systems the HON has a standard value. In EBCDIC the HON has a hexadecimal value of F. In ASCII it is 3. This means that the numbers 0 - 9 in EBCDIC have a hexadecimal representation of 'F0'x - 'F9'x. In ASCII the hexadecimal representation is '30'x - '39'x. Non-standard numeric data makes use of normally unused HON as well as the LON to store numeric values in a smaller number of bytes. Here's an example:

On a mainframe, if you store the number 505 as a plain numeric in a file, the number will be written as 'F5F0F5'x. If you write the same number as a packed decimal (PD) value, it will be written as '505C'x. The same number written as plain numeric on an ASCII system will be stored as '353035'x. When stored as an ASCII PD field, the value is '000505'x.

Moving Data Between Platforms

There are multiple ways to move data files from a mainframe to a PC. For example, PCs may have devices available that can read 3480 and/or 3490 cartridge tapes created on a mainframe. The device can be used to read the data directly from the tape to the PC application or to copy the data from tape to a PC's hardfile. The more common method is to move the data with a FTP program.

By default, most FTPs will convert EBCDIC data to ASCII when going from the mainframe to the PC. When dealing only with standard numeric and character data this is the best approach. In this situation the file that you have on the PC has been properly converted from EBCDIC to ASCII and the FTP also creates the proper carriage control characters. You need only issue an INFILE statement pointing to the file created by the FTP and read it with an INPUT statement.

However, when the file contains non-standard numeric data, problems arise. This is because the FTP is designed to convert an EBCDIC character to the matching ASCII character. Non-standard numeric data, in some cases, will look like standard character data to the FTP, so the FTP will change it to the cooresponding ASCII character. The problem here lies in the way the two different systems pack the data as described above in Standard and Non-Standard Numeric Data. Here's an example of how things can go wrong.

Expanding on the example from above, suppose you have an EBCDIC data file that contains a numeric value of 505 stored as a packed decimal (PD) field. The data stored there is '505C'x. If you were to look at that data on the mainframe with a file browser or editor you would see the characters &*. This is because the '50'x cooresponds to a & and the '5C'x to a * in the EBCDIC code table. The browser or editor display the characters associated with the hexadecimal values stored in the file. This file is fed through FTP and sent to a PC with the translation option in effect.

When this data goes through the FTP doing character conversion, the '50'x is changed to '26'x which is the hexadecimal representation for a & in ASCII. Likewise the '5C'x is changed to '2A'x to match the * in the ASCII code table. When you combine the hex values for those characters, you now have '262A'x. Looking again at the sample above, the numeric value of 505 should be stored as '000505'x on ASCII systems. These two hexadecimal values are not the same, so SAS cannot read the value 505 that was originally stored in the file. Instead it finds data that does not conform to PD characteristics for the PC. Since the data doesn't look the way a PD value should on the PC, SAS determines that this is invalid data and raises an error condition. An "Invalid data" message is generated in the SAS log for each nonconforming variable. SAS then writes the contents of the input buffer and DATA Step variables to the SAS log.

In some situations, a byte of data can be changed into '1A'x or '0D'x or '0A'x during translation. If SAS is reading the file with RECFM=V, when SAS encounters one of these bytes, unexpected and undesirable things will happen. Situations that can occur will range from "Invalid Data" messages to the premature termination of the DATA Step. The reason for this is that SAS honors all carriage control (CC) characters when using the default RECFM=V. If SAS encounters a '1A'x, SAS will stop reading the file when it reaches that byte. The value '1A'x is the End of File(EOF) marker in PC files.

Solutions

The only way to overcome the problem of non-standard numeric data being corrupted by the FTP is to move the data without translating it. This will necessitate making some significant changes in your program. It may also require preprocessing the data file on the mainframe. The sections below list the different types of files and situations, a recommended approach to read in the file, and a sample program to accomplish the task.

If Your File Has RECFM=F

When the mainframe file is fixed length, then the solution is to simply download the file in binary. On the FILENAME or INFILE statement specify RECFM=F and same LRECL the file had on the mainframe. Use formatted input style and the $EBCDICw. informat for character data and S370Fxxxw.d informats for numeric data.

NOTE: There are many S370Fxxxw.d informats. You will need to select those that match the type of data that you have. A complete list is available in the SAS Version 8 documentation, either in the online or printed version. For Version 6 refer to the SAS Language Reference, Version 6 First Edition and Technical Report P-242.

When reading an external file and specifying RECFM=F, SAS reads exactly the number of bytes specified in the LRECL without any regard to the data contained in the record. This means that the LRECL that you specify must be exact. When the file is transferred in binary there are no EOR markers at all. If there are hexadecimal values that would be considered control characters in a default read, SAS will ignore them.

RECFM=F Example:

 filename test1 'c:\fixed.txt' recfm=f lrecl=60;
 data one;
 infile test1;
 input @1 name $ebcdic20. 
       @21 addr $ebcdic20. 
       @41 city $ebcdic15. 
       @56 state $ebcdic2. 
       @58 zip s370fpd3.;
 run;

If Your File Has RECFM=V or VB

NOTE: In this section there are many references to RDW. In all cases variable length mainframe files are also blocked and therefore will also have a BDW. Except where noted, all references to the RDW also apply to the BDW.

If you have a 34xx cartridge tape reader attached to your PC then you will not need to preprocess the file on the mainframe. You will be able to read it directly from the tape with the RECFM=V example program below.

When an FTP program is used on variable length data files going from an EBCDIC system to an ASCII system, the RDW is stripped off and an EOR marker is inserted in the proper location. The data in the file is converted from EBCDIC to ASCII. This is fine if the file contains only standard character and numeric data. However, when dealing with non-standard numeric data, converting the file from EBCDCI to ASCII corrupts the non-standard numeric data, as described above. To prevent this corruption, the file needs to be left unchanged when it is moved. You can download the file in binary transfer mode which will move the data without converting it. The problem with binary transfer mode is that, due to the design of FTPs, the RDW is stripped off even if the file is moved without translation. This causes a problem because SAS depends on the RDW information to read the EBCDIC file even when that file is being read on a PC.

There are two ways to overcome this problem when using FTP. The first is to read the file directly off the mainframe if you have direct access between the PC and the mainframe. This is the recommended method. The second is to reformat the file on the mainframe with IEBGENER prior to downloading it to the PC. Changing the format of the file will cause the FTP program to move it without stripping the RDW and BDW. Then download the reformatted file in binary mode to make this file readable on the PC.

FILENAME FTP Access Method

The preferred method of reading an EBCDIC file on a PC is to access the file directly on its mainframe host. This can be done through the FTP access method in a FILENAME statement. There are some advantages to using this method.

  • You do not need to preprocess the file.
  • You do not need to put a copy of the file on your PC.
  • You can use the FTP access method to read fixed or variable length files.
  • The logic of the DATA Step will remain the same.
One disadvantage is that it will require more time to process the file because it is remotely accessed.

The FTP access method uses the FTP program that you have available to open a connection from your PC to your mainframe. The SAS System connects to and logs onto the mainframe under the user account you provide. The FTP program on the PC then downloads the file. In the FILENAME statement specify the access method, FTP, the name of the file to be read, and the options HOST=, USER=, and either PASS= or PROMPT. HOST= is the name of the mainframe system that you're going to read the file from. USER= is the userid that you will be logging into. You'll need to provide the password for the account that you're logging into. You can use PASS='your_pass_word', or PROMPT. If you use PASS=, your password will be visible in your SAS program, but the password will be removed when the program is reprinted in the SAS log. If you specify PROMPT, SAS will prompt you for your password at execution time. Most importantly for variable length files, you need to specify the options S370V and RCMD="site rdw". The S370V option indicates that the file being read is a variable length EBCDIC file. The RCMD="site rdw" option tells the FTP server to include the RDW as the file is downloaded. If you have trouble with the connection to your mainframe, you can add the DEBUG option to obtain informational messages that are sent to and received from the FTP server.

In the following example an variable length file is being read from an MVS host system. The user will be prompted for their MVS logon password. The comments section is varying in length up to 200 characters.

FTP Access Method Example:

filename test1 ftp "'SASXXX.VB.TEST1'" HOST='MVS' USER='SASXXX' PROMPT
           s370v RCMD='site rdw';
data one;
infile test1;
input @1 name $ebcdic20. 
      @21 addr $ebcdic20. 
      @41 city $ebcdic10.
      @51 st $ebcdic2. 
      @54 zip s370ff5. 
      @60 comments :$ebcdic200.;
run;

proc print;
run;

Preprocessing The File With IEBGENER

If you're not familiar with IEBGENER and how to use it, you'll need to find someone at your site who can assist you. IEBGENER is going to make an exact copy of the file, but alter the header information so that your FTP program won't modify the file by removing the RDW. IEBGENER will change the header information for the file from RECFM=V to RECFM=U. When the FTP sees RECFM=U, it will not attempt to remove the RDW because the FTP will consider the RDW as part of the data.

In addition to the required arguments for IEBGENER, specify the following overrides:

NOTE: Do not use the original values of RECFM and BLKSIZE for SYSUT1.

SYSUT1 DCB(RECFM=U,BLKSIZE=32760)
SYSUT2 DCB(RECFM=U,BLKSIZE=32760) DISP=(NEW,CATLG)
Download the new version of the file in binary. On the FILENAME or INFILE statement specify the RECFM as S370V if the original file was RECFM=V or S370VB if the original file was RECFM=VB. Also on the FILENAME or INFILE statment specify the same value for LRECL that the original file had. For instance if the original file had RECFM=VB and LRECL=600 then on the FILENAME or INFILE statement you should specify RECFM=S370VB LRECL=600. RECFM=S370V or S370VB tells SAS to look for the RDW on each record, read that number of bytes, and process it as an individual record. Even though we have the RDW available, the default LRECL of 256 is still in effect when SAS is running on the PC. This is because the LRECL is set at compile time when the file is opened, not as it is read. Therefore if the LRECL of the original file was greater than 256 you will need to specify LRECL on the FILENAME or INFILE statement. And because this file contains EBCDIC data, you will need to use $EBCDICw. for character data and S370Fxxxw.d informats to read the numeric data.

In the following example, the TRUNCOVER option is included in the INFILE statement because the COMMENT variable can be up to 60 characters, but likely will be shorter. The standard rules for reading data with an INPUT statement still apply. Without the TRUNCOVER option, the INPUT statement could attempt to read past the end of the record. This would cause SAS to go to the next input record and read data until the COMMENT variable had been filled. The LRECL is not specified because the default value of 256 is sufficient to handle the longest record in the file.

RECFM=V Example:

filename test1 'c:\vbtest.xfr' recfm=s370vb;
data one;
infile test1 truncover;
input @1 name $ebcdic14.
      @15 addr $ebcdic18.
      @33 zip s370ff5.
      @38 comment $ebcdic60. ;
run;

Other File Issues

As stated above, if a variable record length file is transferred to the PC without being processed through IEBGENER, the RDW and BDW will be stripped off whether transferred in binary or with translation. Without this information SAS will not be able to read an unstructured file. There is one type of variable record length file that SAS will be able to read without the RDW and BDW. This file is commonly referred to as an occurs file.

Occurs Files

In an occurs file, each record is made of three parts: a header section, an index variable, and one or more occurrances of a data segment. The header section is a fixed length record segment containing information that pertains to all of the data segments that follow. The index variable tells how many data segments follow. The third part is one or more data segments.

This file structure can be read without an RDW and BDW because even though the record length is variable, it is predictable. The documentation for the file will provide the length of the record header, the index variable, and the data segments. With this information you can calculate the length of each record. To find the length of a given record, add the length of the header, index variable, the product of the index variable's value and the length of each data segment. The logic is to read the header portion and index variable. Then enter an iterative DO loop that goes from 1 to the value of the index variable. Inside the DO loop an INPUT statement reads each data segment, and an OUTPUT statement writes each segment to the SAS data set as an observation, including the header section.

To read the file you must use RECFM=N which tells SAS that you are reading a stream of data that will not conform to a typical file structure. SAS will treat the file as a very long single record. Logical record length restrictions are lifted when reading a data stream because SAS does not attempt to buffer the input. You will get a note in the SAS log saying "UNBUFFERED is the default with RECFM=N". This is normal. SAS looks at each INPUT statement and reads only the number of bytes required to satisfy that INPUT statement.

Because SAS sees this file as a single record, you must use relative column pointers in your INPUT statement. Line holders have no effect because there is only one line. Your INPUT statements should not contain any @column pointer control. For instance, if you use an "INPUT @column variable $10.;" structure in your INPUT statement, your program will go into an infinite loop and continue reading and outputting the same data over and over until your hard drive is full or you halt the program. You must use relative pointer control.

The following example reads a simple occurs file. The fixed portion of the data is 62 bytes in length and contains a mixture of standard character and numeric data. The repeating portion of the data is 13 bytes in length and a combination of standard character and numeric data.

Occurs Example:

 filename test1 'c:\VB.TEST' recfm=n;

 data one;
 infile test1;
 input name $ebcdic20. addr $ebcdic20. city $ebcdic10.
       st $ebcdic2. +1 zip s370ff5. +1 idx s370ff2. +1;
 do i = 1 to idx;
   input cars $ebcdic10. +1 years s370ff2. ;
   output;
   if i lt idx then input +1 ;
 end;
 run;

Conclusion

In most circumstances you can sucessfully read an EBCDIC file into a SAS data set on an ASCII system. As stated in the introduction, the techinques described here were developed primarily for PC SAS users, but with modifications to the INFILE syntax they will work on any ASCII platform that runs the SAS System Release 6.11 and later. The examples in this paper were developed on a Windows NT system running SAS Version 8.0 TSM0.

Abbreviations and Definitions.

  • EBCDIC - Extended Binary Coded Decimal Interchange Code
  • ASCII - American Standard Code for Information Interchange
  • FTP - File Transfer Protocol
  • Hexadecimal values will be capitalized and quoted with a lower case x following the closing quote.
  • Packed Decimal data is abbreviated PD.
  • Integer Binary data is abbreviated IB.
  • RDW Record Descriptor Word. A 4-byte IB value that describes the length of a record in a variable record length mainframe file. The data is stored in the first two bytes. The last two bytes are reserved by the operating system.
  • BDW Block Descriptor Word. A 4-byte IB value that describes the length of a data block in a blocked mainframe file. The data is stored in the first two bytes. The last two bytes are reserved by the operating system.
  • EOR End-Of-Record marker. Normally some combination of '0D'x, a Carriage Return (CR) and '0A'x, a Line Feed (LF).
  • EOF End Of File marker. '1A'x on PC's and some other ASCII platforms.
  • HON High Order Nibble
  • LON Low Order Nibble
  • RECFM= RECord ForMat
  • RECFM=F is Fixed record length, all records are the same size.
  • RECFM=V is Variable length records. Length of the records are not necessarily the same.
  • RECFM=N Reads data as a data stream with out any record structure. LRECL has no meaning in this type of read.
  • RECFM=U (EBCDIC systems): Undefined record length. Unblocked. Basically the same as N above.
  • LRECL= Logical RECord Length. Number of bytes on each record for RECFM=F files or maximum record length for Variable length files on ASCII machines. Maximum record length +4 on mainframes.
  • S370F Informats - These are Informats that were created to read special numeric values like PD or IB from EBCDIC files on ASCII machines. They all begin with S370F.
  • Standard Character Set - For the purpose of this paper, the Standard Character Set will mean those characters that can be typed in from a standard keyboard.