|
DATA Step Programming Using the MODIFY Statement AbstractThis article introduces the MODIFY statement and provides information that will enable DATA step programmers to incorporate the MODIFY statement into new and existing applications. Topics presented include the MODIFY statement's access methods, statements used in conjunction with the MODIFY statement, and examples of specific applications. Readers should be experienced DATA step programmers who are familiar with SET, MERGE, and UPDATE processing. The article assumes a knowledge of the terms sequential, direct, and transaction as they apply to the processing of SAS data sets. In addition, readers should be familiar with the two types of SAS data sets, SAS data files and SAS data views, and how they differ. A knowledge of SAS indexes is assumed as well. Introduction The MODIFY statement, new in Release 6.07 of the SAS System for MVS, CMS, and VMS operating systems, adds a new dimension to DATA step programming. Prior to Release 6.07, all DATA step processing of SAS data sets was performed by executing the SET, MERGE, and UPDATE statements. With the introduction of the MODIFY statement, many limitations imposed by the SET, MERGE, and UPDATE statements are lifted. For example, the MODIFY statement
Probably the most powerful feature of the MODIFY statement is the ability to update SAS data files in place: a new concept for Release 6.07 under MVS, CMS, and VMS. Release 6.06 and Version 5 DATA step applications performing update operations on existing data files require that an existing data file be replaced by the new 'updated' copy. This type of processing increases storage requirements for the duration of the DATA step execution. Applications can fail when insufficient disk space is available to hold both the existing data file and the new copy as it is being written. The MODIFY statement resolves the space problem for applications that delete or modify observations by allowing the existing data file to be updated in place; no copy of the file is produced. Availability of disk space can still be a concern in applications using the MODIFY statement to add new observations. Note, however, that even in the append case, observations are added to the end of the existing data set and not to a copy. Some applications may still require that a copy of the data file be created. For example, DATA step applications that take an existing data file and either drop or add new variables or otherwise alter the structure of the descriptor portion of the data file require creation of a new data file. Applications of this type cannot use the MODIFY statement, but must use the SET, MERGE, or UPDATE statement. The MODIFY statement can only be used to update values of existing data set variables; no changes can be made with respect to the structure of the program data vector that is created at execution time for the data set being modified. Use of the MODIFY statement is dependent on the type of processing to be performed: sequential, matching, or direct. Examples of each type of processing are presented using a sample data file. In addition to gaining an understanding of how the MODIFY processing takes place, the examples illustrate the required syntax for each mode of access. Sample Data The sample SAS data file INVTY.STOCK is used in each example and is described below. This is the same permanent data file described and used in SAS Technical Report P-222, "Changes and Enhancements to BASE SAS Software". The technical report can be referenced for a complete discussion of MODIFY statement syntax. The SAS data file contains five variables and a simple index defined for variable PARTNO. Variables: PARTNO - unique numeric value for part number DESC - character text description of each part INSTOCK- numeric value representing number of units in stock RECDATE- SAS date value representing when data set values are current PRICE - numeric value representing part unit price Indexes: PARTNO - simple index composed of PARTNO variable values Sequential Access The most straightforward application of the MODIFY statement uses sequential access to update a SAS data set. In this example, values of the variable RECDATE need to be modified to contain the current date.
libname invty 'SAS-data-library';
data invty.stock;
modify invty.stock;
recdate=today();
run;
In this application, note that the MODIFY statement and the DATA statement reference the same data file. MODIFY processing requires that the first data set referenced in the MODIFY statement match at least one of the data sets named in the DATA statement. In the example, the SAS System reads an observation from the STOCK data file and replaces the existing value of RECDATE with the value returned by the TODAY function. The DATA step performs this process for each observation in the data file STOCK using a sequential access method. Note that had the SET statement been used to perform this same task, a new copy of data file STOCK would exist in the library referenced by INVTY at DATA step termination. This new copy would replace the existing copy of data file STOCK. Matching Access The matching access method takes information from a transaction data set and uses the information to locate and update observations in a master data set. The term 'master data set' is defined as the data set being modified. Illustrating the matching access method requires a transaction data set. The transaction data set can be a temporary or permanent SAS data set that is generated from an interactive data entry application or some other SAS application. In the example, transaction data set ADDINV is a temporary SAS data file containing two variables. The variable PARTNO in the transaction data set corresponds to the PARTNO variable in the master data file, INVTY.STOCK. Variables: PARTNO - values correspond to part numbers in INVTY.STOCK NWSTOCK- numeric value representing quantities of newly received stock The matching access method requires that a BY statement be used to match observations from the transaction data set to observations in the master data set. The example below uses the matching access method to apply updates to the master data set INVTY.STOCK using information from transaction data set ADDINV.
data invty.stock;
modify invty.stock addinv;
by partno;
instock=instock + nwstock;
recdate=today();
run;
The code above shows that the MODIFY statement syntax for the matching access method is almost identical to that of the UPDATE statement. Both statements require that a master and transaction data set be specified with the master data set being listed first. Both statements also require a BY statement to list the variables whose values from the transaction data set are used to locate observations in the master data set. However, the processing that takes place during execution differs. An important concept to stress is how the matching access method retrieves observations for update. During execution,
The current observation is processed by the DATA step code and is either replaced, deleted from, or appended to the master data set. Note that the matching access method is the only method that allows more than one data set to be specified on the MODIFY statement. Specifying two data sets implies matching access and requires that a BY statement be used to name the variable(s) to be used in the matching process. Remember that the use of MODIFY in this case does not require that the master and transaction data sets be sorted in order of the BY variable(s). Sorting is not required because BY variable values from the transaction data set are used to create dynamic WHERE expressions. These expressions are applied to the master data set for search and retrieval operations. While sorting master and transaction data sets is not required, access to the master data set is faster if the data sets are sorted. The explanation centers around the fact that a sequential access method is used to retrieve observations from the master data set using the generated WHERE expressions. Sorting both the master and transaction data sets can dramatically reduce the I/O required to retrieve the target observation. In addition, quicker access to the master data set is possible if an index is defined for the BY variable on the master data set. Using the index structure to locate and retrieve master data set observations can be much faster than sequential access to the data set itself, especially in cases where the master data set is very large and not sorted in order of the BY variable(s). As mentioned earlier, one of the restrictions of the UPDATE statement is that processing does not continue if duplicate values of the BY variable are detected in the master data set. This restriction does not apply to the MODIFY statement. Duplicate values of the BY variable can exist in both the master and transaction data set. While duplicate values are permitted, specific rules dictate how the updates are applied from the transaction data set. The following rules always apply when duplicate values exist:
For the matching access example above, the term consecutively means if ADDINV contains more than one observation with a specific value for PARTNO, each value of NWSTOCK associated with the duplicate values of PARTNO is added to INSTOCK in succession. The new value for INSTOCK at DATA step termination is the sum of all NWSTOCK values from ADDINV for the specific PARTNO plus the original value of INSTOCK from INVTY.STOCK. If both INVTY.STOCK and ADDINV contain more than one observation with a specific value for PARTNO, then the result just described is applied to the first occurrence of INSTOCK from the master data set INVTY.STOCK. The remaining occurrences of INSTOCK associated with the duplicate values of PARTNO in the master data set remain unchanged. This behavior is in accordance with the rules stated above. The term, 'first occurrence', is defined as the first observation retrieved by the method being used (sequential, sorted sequential, or indexed) to retrieve observations from the master data set. If the matching access example had explicitly replaced existing values of INSTOCK rather than sum or accumulate values then the term consecutively takes on a slightly different meaning. If ADDINV contains more than one observation with a specific value for PARTNO, replacing the statement instock=instock + nwstock; with the statement instock=nwstock; causes the value of INSTOCK at DATA step termination to be equal to the value of NWSTOCK from the last observation in transaction data set ADDINV for the specific PARTNO value. A final note on the matching access method is that updates can be applied in an 'automatic' manner if the master and transaction data sets contain variables other than the BY variables that are common to both data sets. To illustrate, if NEWP is modified to contain variables PARTNO and INSTOCK, and the application objective is to replace values in the master data set with values from the transaction data set, then the following code accomplishes the task.
data invty.stock;
modify invty.stock addinv;
by partno;
run;
In the example above no assignment statement is necessary because non-missing values for INSTOCK from transaction data set ADDINV overlay INVTY.STOCK values of INSTOCK in the program data vector. The modified observations are then re-written to data set INVTY.STOCK. In this example which uses the MODIFY statement, the rules that apply to updates of the master data set from transaction data set observations with missing values are the same as for UPDATE processing. Direct Access by Observation Number The direct access method supports two types of random or direct access. One type uses observation numbers to directly access the data file being modified. The other type takes advantage of an existing index to provide direct access capability. The first type, direct access by observation number, uses the POINT= option, and the optional NOBS= option, in the MODIFY statement. This syntax is similar to that which can be used with the SET statement, so you may already be familiar with this type of access. The POINT= and NOBS= parameters provide the same information during MODIFY execution as they do during execution of a SET statement. To illustrate this form of direct access, assume you have another SAS data file named NEWP that contains the following variables: TOOL OBS - contains the number of the observation to be modified NEWPRICE - contains the new price for the part defined by TOOL_OBS The following example uses the information in data file NEWP and illustrates modifying data file INVTY.STOCK using direct access by observation number.
data invty.stock;
set newp;
modify invty.stock point=tool_obs;
price=newprice;
recdate=today();
run;
Data file NEWP is referenced in the SET statement which reads in values to be supplied to the TOOL_OBS and NEWPRICE variables. Although SAS data file NEWP supplies information to the MODIFY statement, NEWP is not a transaction data set as it does not appear as the second data set in the MODIFY statement. A transaction data set would be processed during the MODIFY statement execution; SAS data file NEWP is processed during SET statement execution. Any additional data sets processed during the DATA step execution for the direct access methods must appear on a separate SET, MERGE, or UPDATE statement. Variable TOOL_OBS is listed as the value of the POINT= option. Variable NEWPRICE is referenced in the assignment statement to replace existing values of PRICE in INVTY.STOCK. As the SET statement executes, values of TOOL_OBS are read from NEWP and supplied to the execution of MODIFY which uses the values to retrieve observations directly from INVTY.STOCK. The observation number is actually used as a key to allow direct retrieval of observations. Note that the concept of an observation number applies only to non-compressed SAS data files. When using POINT= to perform direct access you should include a method to either check for invalid values of the POINT= variable or check the value of _ERROR_ after the execution of the MODIFY statement. The example above can be modified to incorporate a method for checking the status of automatic variable _ERROR_ after the MODIFY statement execution. The modified code appears below.
data invty.stock;
set newp;
modify invty.stock point=tool_obs nobs=max_obs;
if _error_ = 1 then do;
put 'ERROR occured for TOOL OBS= ' tool_obs /
'during DATA step iteration ' _n_ /
'TOOL_OBS value may be out of range for data set.';
_error_=0;
stop;
end;
price=newprice;
recdate=today();
run;
The above example could have included code that validated the TOOL_OBS value prior to execution of the MODIFY statement. The ability to validate TOOL_OBS is made possible by the fact that the TOOL_OBS value must be in the range of 1 to MAX_OBS. MAX_OBS is assigned a value at compile time equal to the total number of observations in INVTY.STOCK. The value is then available at execution time to incorporate into a method to validate TOOL_OBS values. Implementing methods such as the ones presented here ensures that results of the DATA step execution are not unpredictable. Direct Access Using Indexes The second type of direct access uses an existing index to access observations in the master data file. The example below illustrates the direct access method using the index which has been defined for variable PARTNO. Data file ADDINV is used again in this example to supply PARTNO values to use for index read operations and to supply NWSTOCK values to use in updating the INSTOCK variable. ADDINV appears in the SET statement to supply information to the keyed, direct access method being used during the execution of the MODIFY statement.
data invty.stock;
set addinv;
modify invty.stock key=partno;
instock=instock+nwstock;
recdate=today();
run;
The KEY= option on the MODIFY statement informs the SAS System that an index should be used with a specific name for the data set to be modified. In this case, the SET statement executes and reads observations from data file ADDINV to supply index values via the PARTNO variable. The PARTNO values are used to access the appropriate observations from the STOCK data set. Note that the STOCK data set does not have to be sorted and also that no dynamic WHERE processing takes place for this access method; the index structure is used to locate observations for update. For this type of access, it is appropriate to discuss the situation where multiple observations in either the data file being modified or the data set supplying the index values, or both, contain the same values for the index. Discussed first are the cases where either the master data set or the data set supplying key values, not both, contain observations with duplicate values for the PARTNO index.
Discussed now are the situations where duplicate index values occur in multiple observations of both data sets. Update behavior is dependent upon whether the observations containing the duplicate index values in data file ADDINV are all grouped together or whether they occur as non-contiguous observations.
Based on the results discussed above, it is recommended that the UNIQUE option be used in cases where observations from the data set supplying index values are group together unless the application intends for a one-to-one retrieval and update operation to take place. Statements Supported By the MODIFY Statement The unique features of the MODIFY statement are the ability to rewrite, delete and append observations to a SAS data set. The REPLACE, REMOVE and OUTPUT statements can be used in conjunction with the MODIFY statement to support these features respectively. The syntax of the REPLACE and REMOVE statements is similar to that of the OUTPUT statement. The difference is that only one data set reference can be made on the REMOVE or REPLACE statements whereas OUTPUT supports multiple data set references. Remember, only one data set can be the target of a MODIFY execution. The OUTPUT, REPLACE and REMOVE statements perform the following function:
OUTPUT the current observation is written to the end of the data set
REPLACE the current observation is rewritten and contains any
information modified up to the point of execution of the
REPLACE statement
REMOVE the current observation is deleted from the data file being
modified
When using the OUTPUT, REPLACE, and REMOVE statements together in the same DATA step, there are interactions that can affect the outcome of the DATA step execution. Some restrictions and rules must apply in order to maintain proper position within the data set and index structure and to control the behavior of the DATA step execution. Listed below are descriptions of restrictions and behavior that apply when the OUTPUT, REMOVE and REPLACE statements are used in conjunction with one another. Use of the OUTPUT, REPLACE and REMOVE Statements
Below are the examples from a paper presented at SUGI 17 to illustrate the rules governing support statement interaction. The paper entitled "DATA Step Changes and Enhancements", by Jeff Polzin, is an excellent reference for the MODIFY statement. Example 1: No OUTPUT, REMOVE or REPLACE statements specified.
data a b;
modify b;
<< Included here would be additional DATA step code >>
<< that modifies the data in B and generates the >>
<< data to be written to new or replacement data file A >>
run;
In the above example, a default OUTPUT statement for data file A and a default REPLACE statement for data set B are generated for each iteration of the DATA step. These statements are executed implicitly at the bottom of the DATA step program. Example 2: REMOVE statement is specified explicitly.
data a b;
modify b;
<< Additional code to change data in B >>
if <some condition> then
REMOVE b;
run;
By explicitly specifying the REMOVE statement, no default OUTPUT statement is generated for data set A and no default REPLACE statement is generated for B. In this case, a REMOVE is executed only when the condition is true. The net result is that A would contain no observations and no observations in B would be rewritten. Observations in B would only be deleted when the condition is true. To force execution of the two default statements OUTPUT and REPLACE, they must be specified explicitly as in Example 3 below. Example 3: OUTPUT and REPLACE statements are specified explicitly
data a b;
modify b;
<< Change data in data set B >>
if <some condition> then
remove b;
else if <some other condition> then
output a;
else replace b;
run;
Automatic Variable _IORC_ and Macro %SYSRC _IORC_ is an automatic DATA step variable that is created whenever the MODIFY statement appears in a DATA step. NOTE: The _IORC_ variable is also generated whenever the KEY= option is specified on the SET statement and during DATA step VSAM processing. However, these two occurrences of _IORC_ are beyond the scope of this article and therefore will not be discussed. _IORC_ provides return codes to the executing DATA step that describe abnormal I/O conditions. Checking for the occurence of abnormal I/O conditions allows the application to control them and direct execution down specific code paths versus having the application terminate abnormally. Below is a table of abnormal conditions for MODIFY, REPLACE, and REMOVE statement execution that provide non-zero return codes for _ IORC_. Included in the table are the mnemonics that represent the specific abnormal condition. DATA step code Condition Mnemonic
MODIFY with POINT= no matching observation _DSENOM
MODIFY with KEY= no matching observation _DSENOM
MODIFY with BY statement no matching master observation _DSENMR
MODIFY with BY statement multiple transaction data set
observations do not exist on
the master data set _DSEMTR
REPLACE statement executed on non-current observation _SENOTRD
REMOVE statement executed on non-current observation _SENOTRD
A non-current observation condition is defined as a situation where the SAS System determines that either a MODIFY statement has not been executed prior to encountering a REPLACE or REMOVE statement for the current observation, or a MODIFY statement executed but the matching specifications resulted in a read failure and therefore no current observation exists to update or delete. _IORC_ should be used in conjunction with the autocall macro %SYSRC. Since _IORC_ values are subject to change in future releases of the SAS System, %SYSRC allows you to specify a mnemonic name that describes a potential outcome of an I/O operation. Each mnemonic describes exactly one specific abnormal condition. The mnemonics provide an easy method for DATA step programs to test for potential problems involving MODIFY statement execution. The macro expands the specified mnemonic into the corresponding numeric value for _IORC_ . The %SYSRC macro and associated mnemonic values for MODIFY statement execution are documented in Chapter 42 of SAS Technical Report P-222, "Changes and Enhancements to BASE SAS Software". A complete list of all mnemonics, their correspondinq numeric values for _IORC_ , and the I/O conditions they describe are contained in the SYSRC member of the AUTOCALL macro library (prefix.AUTOLIB) shipped to all Release 6.07 MVS, CMS, and VMS sites by SAS Institute. The example below uses several mnemonics to check for specific I/O conditions.
data invty.stock;
set receive;
modify invty.stock key=partno;
select (_iorc_);
when (%sysrc(_sok)) do; /* Match located, rewrite obs. */
<< Code to change some data in STOCK >>
REPLACE;
end;
when (%sysrc(_dsenom)) do; /* No match, append obs. */
OUTPUT;
_error_=0;
end;
otherwise do; /* An unexpected I/O condition */
<< Code to print error message >>
_error_=0
stop;
end;
end;
run;
Mnemonic _SOK returns a zero return code when used with %SYSRC and defines a condition indicating successful execution of the MODIFY statement, here using direct access via an index, while mnemonic _DSENOM describes the condition where no matching observation was located in the master data set. Once the RECEIVE data set is read to obtain values for the PARTNO index, the MODIFY execution uses the values to locate observations for update within the STOCK data set. The SELECT structure allows the application to take alternate code paths based on the specific I/O condition that exists following the execution of the MODIFY statement. If the observation was successfully retrieved for update ( _SOK implies _IORC_=0), then changes are made and REPLACE is executed. If there is no matching observation for the specified index value of PARTNO, _DSENOM, then the current information is appended as a new observation to data set STOCK. For any other I/O condition that we have not explicitly checked for, an error message is printed and execution is halted. The ability to check for specific I/O conditions and control them provides a powerful tool for DATA step applications. Additional Applications Using MODIFY The MODIFY statement is a powerful tool which can be used to provide new capabilities and flexibility to DATA step applications. This section illustrates several of the capabilities. The following example shows how to use the MODIFY statement to mimic PROC APPEND.
data a;
if 0 then modify a;
set b;
output;
run;
This DATA step reads observations from B and appends them to A without reading any observations from A. The key here is that MODIFY is never executed because the condition zero is never true. At compile time, the MODIFY statement is recognized and therefore support for all of its features is provided. The overhead of the read operation that would normally precede the append operation is eliminated because MODIFY is never actually executed. As presented, the results are identical to PROC APPEND. This next example uses information from a non-SAS file to update an existing SAS data file.
data a; /* A contains two numeric variables, X and Y */
infile 'external.data.file';
input fieldl field2;
modify a;
x = fieldl;
y = field2;
run;
The point to emphasize here is that any external source of data that can be accessed by the SAS System can be used to supply update information to the MODIFY statement execution. FIELDl and FIELD2 are not added to data set A, their values are used to update the existing values for variables X and Y. MODIFY Statement Usage Notes Advanced use of the MODIFY statement requires a thorough knowledge of how DATA step execution behaves in specific instances. This section attempts to present some cases where execution results may differ from what is expected by the programmer. The following example illustrates that while any variable can be used during the DATA step execution, new variables cannot be added and existing variables cannot be dropped from data sets that are operated on by MODIFY. Consider the results.
data a; /* A has only one variable, X */
modify a;
y=x;
run;
The result of the above execution is that nothing changes with respect to the contents of data set A. The DATA step executes and reports that observations are rewritten but the replaced observations are identical to the observations read initially by the MODIFY statement. The variable Y does not exist on data set A and it cannot be added without creating a new data set. To change the data set's descriptor information and hence alter the program data vector that is created at execution time for the data set, use the SET statement. Potential Problems Associated with the MODIFY Statement A potential problem that is of most concern applies to DATA step executions using the MODIFY statement that terminate abnormally due to system failure or some other unrecoverable error condition. In cases of abnormal termination, the data set being modified can be permanently damaged. The damage occurs because the MODIFY statement opens the original data set for update versus writing out a new, updated copy of the data set being processed. Presented next is an example to help programmers avoid a looping condition that can occur in applications which use the MODIFY statement. The described looping condition does not produce an error message and is therefore difficult to detect.
data a; /* A has only one variable, Y, and one observation */
modify a;
y='new value';
output;
run;
The result in the above case is a looping DATA step that continues to execute until either the WORK library fills up or until a time out condition arises. The loop occurs because after the first and only observation is read, that observation is then appended to the end of the data set. This, now the second, observation is read during the next DATA step iteration and it is subsequently appended as the third observation. The process continues, resulting in a loop, as the DATA step iterates until one of the above mentioned conditions arises. If the explicit OUTPUT statement were removed, the DATA step would iterate only once. A default REPLACE statement would be generated and executed at the bottom of the DATA step to replace the original value of Y with the new value. If a specific number of observations are to be appended to data set A, say ten, then the following statement placed after the OUTPUT statement avoids the loop. if _n_ eq 10 then stop; The last example stresses how a MODIFY statement execution affects the concept of the most recently created SAS data set. Examine the following code.
data a;
x=l;
strl='text 1';
run;
data b;
x=1;
str2='text 2';
run;
data a;
modify a b;
by x;
strl=str2:
run;
proc print; /* Which data set is printed? */
run;
The example above concludes with the execution of PROC PRINT. Since the DATA= option is not specified, the PRINT procedure prints the contents of the most recently created data set - data set B in this case, not A. This occurs because data set A is modified and not created anew. If a SET statement is used in place of the MODIFY statement, then the contents of data set A are printed. Consider also, if another data set, say C, is specified on the DATA statement along with A then default OUTPUT statements are generated for data set C. In this case, the PRINT procedure prints the contents of data set C which contains three variables (X, STRl, and STR2) and a number of observations equal to that of data set B. Conclusion The MODIFY statement is indeed a powerful tool available to programmers using Release 6.07 of the SAS System under MVS, CMS, and VMS. Several of the restrictions inherent to processing data sets with SET, MERGE, and UPDATE do not apply to the MODIFY statement. In addition, new functionality is introduced into the DATA step with the ability to update existing SAS data files in place. The goal of this article is to provide a fundamental understanding of the different access methods associated with MODIFY statement processing, and the statements supported by the MODIFY statement. With this information, applications programmers can begin to incorporate new methods for updating SAS data files and the data defined by SAS access view descriptors into new and existing applications. To supplement the information contained in the article, specific examples that use the MODIFY statement and target problem areas are presented. It is intended that programmers use the contents of this article to begin to take advantage of the MODIFY statement in their DATA step programs. |