When programming in SAS, there is almost always more than one way to accomplish a task. Beginning programmers may think that there is no difference between using the WHERE statement and the IF statement to subset your data set. Knowledgeable programmers know that depending on the situation, sometimes one statement is more appropriate than the other. For example, if your subset condition includes automatic variables or new variables created within the DATA step, then you must use the IF statement instead of the WHERE statement. This tip shows you how and when to apply the WHERE and IF statements to get correct and reliable results. It also reviews the similarities as well as the differences between these two SAS programming approaches. Detail differences in program efficiency between the two approaches will not be covered in this tip.
The following code creates the sample data set that is used in the examples. The data set contains test scores of three classes from three students.
data exam; input name $ class $ score ; cards; Tim math 9 Tim history 8 Tim science 7 Sally math 10 Sally science 7 Sally history 10 John math 8 John history 8 John science 9 ; run;
Key Differences between WHERE and IF Conditions to Subset Data Sets
Below is a table that summarizes the key differences between WHERE and IF Conditions. The examples following this table show some of these differences. The source for this table is Sharpening Your SAS Skills, CRC Press (www.crcpress.com), April 2005.
|Subset Data set||WHERE||IF|
|No Difference between WHERE and IF Conditions|
|Using variables in data set||X||X|
|Using SET, MERGE, or UPDATE statement if within the DATA step*||X||X|
|Must use IF Condition|
|Accessing raw data file using INPUT statement||X|
|Using automatic variables such as _N_, FIRST.BY, LAST.BY||X|
|Using newly created variables in data set||X|
|In combination with data set options such as OBS =, POINT = , FIRSTOBS =||X|
|To conditionally execute statement||X|
|Must use WHERE Condition|
|Using special operators such as LIKE or CONTAINS||X|
|Directly using any SAS Procedure||X|
|Using index, if available||X|
|When subsetting as a data set option||X|
|When subsetting using PROC SQL||X|
|Be careful which you use!|
|When merging data sets***||SUBSET BEFORE MERGING||SUBSET AFTER MERGING|
* WHERE condition requires one of these statements if used within the DATA step. In addition, the variable specified in the WHERE condition must exist in all data sets.
** OBS = data set option is compatible with the WHERE statement in SAS version 8.1 and higher. When OBS = is used with the IF statement, SAS first subsets the data set based on the number of observations in the OBS = option and then applies the IF subset condition. When OBS = is used with the WHERE statement, SAS first applies the WHERE subset condition and then restricts the output data set to contain the maximum of observations as specified in the OBS = option.
*** The Colon Modifier (:) works with the IF statement to compare shorter text with longer text.
**** WHERE condition may be more efficient because SAS is not required to read all observations from the input data set.
***** Results may be different depending on the data sets being merged. In general, use the IF condition to subset the data set after merging the data sets.
In the first example, the data set contains both a WHERE and IF statement to subset the data set. The records in the new data set, student1, will contain those records that meet both WHERE and IF conditions.
data student1; set exam; * Can use WHERE condition because NAME variable is a data set variable; * WHERE condition requires all data set variables; where name = 'Tim' or name = 'Sally'; * Create CLASSNUM variable; if class = 'math' then classnum = 1; else if class = 'science' then classnum = 2; else if class = 'history' then classnum = 3; * Use IF condition because CLASSNUM variable was created within the DATA step; if classnum = 2; run; proc print data = student1; run;
As you can see below, both records meet both conditions. Subset condition 1 required Tim's or Sally's scores. Since either the WHERE or IF statement could have been used in condition 1, using the WHERE statement would be more efficient. Subset condition 2 required the CLASSNUM variable to equal 2 or any science score. The IF statement must be used in condition 2 because the CLASSNUM variable is created in the DATA step.
Obs name class score classnum 1 Tim science 7 2 2 Sally science 7 2
In the second example, the exam data set is sorted by the NAME variable. The DATA step uses the IF statement to keep the FIRST.NAME record, because the WHERE statement can not be applied to FIRST.BY or LAST.BY variables. Variables in the WHERE statement must exist in the data set and can not be temporary variables.
proc sort data = exam out=student2; by name; run; data student2; set student2; by name; * Use IF condition because NAME is the BY variable; if first.name; run; proc print data = student2; run;
As you can see below, there is only one record for each student. This meets the subset condition to keep only the first record for each unique value of the student's name. Since the data set stored the math test scores as the first record for each student, this is the only subject in the student2 data set.
Obs name class score 1 John math 8 2 Sally math 7 3 Tim math 9
The third example uses the two sample data sets below to show the difference in using WHERE and IF statements when merging data sets:
data school; data school_data; input name $ class $ score ; input name $ class $ score ; cards; cards; A math 10 A math 10 B history 10 B history 8 C science 10 C science 7 ; ; run; run;
Below are two DATA steps using WHERE or IF statement:
data school_where; data school_if; merge school school_data; merge school school_data; by name; by name; * subsets BEFORE merging; * subsets AFTER merging; where score = 10; if score = 10; run; run;
Since the WHERE statement applies the subset condition before merging the data sets, all records from the school data set are selected and only the first record from the school_data data set is selected. Because of this, all records from the first data set are kept in the school_where data set. In general, you will want to use the IF statement to apply the subset condition after merging the data sets. This approach will first merge the two data sets and then apply the subset condition to result in the first record of both school and school_data sets.
Below are the two different data sets from using WHERE and IF statements. Be careful to use the correct subset method since the results could be very different.
school_where data set school_if data set A math 10 A math 10 B history 10 C science 10
Make sure you apply the following rules when determining which approach to take when subsetting your data set using the DATA step. If your subset condition does not meet the requirements below, then the WHERE and IF statements should produce identical results. For cases such as this, use the WHERE statement since it is more efficient. Note that having both WHERE and IF statements within the same DATA step has a cumulative effect.
About the Author
Sunil Gupta is the principal consultant and trainer at Gupta Programming (www.GuptaProgramming.com). You can find this tip along with other useful techniques in Sunil's latest book, Sharpening Your SAS Skills (available from SAS Press and from www.crcpress.com).
|Product Family||Product||System||SAS Release|
|SAS System||Base SAS||All||n/a|
|Topic:||SAS Reference ==> Statements ==> File-handling ==> MERGE|
SAS Reference ==> Statements ==> File-handling ==> SET
SAS Reference ==> Statements ==> File-handling ==> UPDATE
Common Programming Tasks ==> Conditional Processing
Common Programming Tasks ==> Improving Performance
Data Management ==> Manipulation and Transformation ==> Combining and Modifying Data Sets
SAS Reference ==> DATA Step
SAS Reference ==> Statements ==> Action ==> IF, subsetting
SAS Reference ==> Statements ==> Action ==> WHERE
|Date Modified:||2005-08-25 16:51:47|
|Date Created:||2005-06-01 17:22:13|