Previous Page | Next Page

Creating Subsets of Observations

Selecting Observations for a New SAS Data Set


Deleting Observations Based on a Condition

There are two ways to select specific observations in a SAS data set when creating a new SAS data set:

  1. Delete the observations that do not meet a condition, keeping only the ones that you want.

  2. Accept only the observations that meet a condition.

To delete an observation, first identify it with an IF condition; then use a DELETE statement in the THEN clause:

IF condition THEN DELETE

Processing the DELETE statement for an observation causes SAS to return immediately to the beginning of the DATA step for a new observation without writing the current observation to the output DATA set. The DELETE statement does not include the observation in the output data set, but it does not delete the observation from the input data set. For example, the following statement deletes observations that contain a missing value for LandCost:

if LandCost = . then delete;

The following DATA step includes this statement:

options pagesize=60 linesize=80 pageno=1 nodate;
data remove;
   set mylib.arts;
   if LandCost = . then delete;
;

proc print data=remove;
   title 'Tours With Complete Land Costs';
run;

The following output displays the results:

Deleting Observations That Have a Particular Value

                         Tours With Complete Land Costs                        1

                                          Land               Tour
            Obs    City         Nights    Cost    Budget     Guide

             1     Rome            3       750    Medium    D'Amico
             2     Paris           8      1680    High      Lucas  
             3     London          6      1230    High      Wilson 
             4     Madrid          3       370    Low       Torres 
             5     Amsterdam       4       580    Low              
New York, the observation that is missing a value for LandCost, is not included in the resulting data set, REMOVE.

You can also delete observations as you enter data from an external file. The following DATA step produces the same SAS data set as the REMOVE data set:

options pagesize=60 linesize=80 pageno=1 nodate;
data remove2;
   infile 'input-file' truncover;
   input City $ 1-9 Nights 11 LandCost 13-16 Budget $ 18-23
         TourGuide $ 25-32;
   if LandCost = . then delete;
;

proc print data=remove2;
   title 'Tours With Complete Land Costs';
run;

The following output displays the results:

Deleting Observations While Reading from an External File

                         Tours With Complete Land Costs                        1

                                          Land               Tour
            Obs    City         Nights    Cost    Budget     Guide

             1     Rome            3       750    Medium    D'Amico
             2     Paris           8      1680    High      Lucas  
             3     London          6      1230    High      Wilson 
             4     Madrid          3       370    Low       Torres 
             5     Amsterdam       4       580    Low              

Accepting Observations Based on a Condition

One data set that is needed by the travel agency contains observations for tours that last only six nights. One way to make the selection is to delete observations in which the value of Nights is not equal to 6:

if Nights ne 6 then delete;

A more straightforward way is to select only observations meeting the criterion. The subsetting IF statement selects the observations that you specify. It contains only a condition:

IF condition;

The implicit action in a subsetting IF statement is always the same: if the condition is true, then continue processing the observation; if it is false, then stop processing the observation and return to the top of the DATA step for a new observation. The statement is called subsetting because the result is a subset of the original observations. For example, if you want to select only observations in which the value of Nights is equal to 6, then you specify the following statement:

if Nights = 6;

The following DATA step includes the subsetting IF:

options pagesize=60 linesize=80 pageno=1 nodate;
data subset6;
   set mylib.arts;
   if nights=6;
; 

proc print data=subset6;
   title 'Six-Night Tours';
run;

The following output displays the results:

Selecting Observations with a Subsetting IF Statement

                                Six-Night Tours                                1

                                          Land               Tour
             Obs      City      Nights    Cost    Budget    Guide

              1     London         6      1230     High     Wilson
              2     New York       6         .              Lucas 
Two observations met the criteria for a six-night tour.

Comparing the DELETE and Subsetting IF Statements

The main reasons for choosing between a DELETE statement and a subsetting IF statement are that

One additional situation favors the subsetting IF: it is the safer method if your data has missing or misspelled values. Consider the following situation.

Tradewinds Travel needs a SAS data set of low- to medium-priced tours. Knowing that the values of Budget are Low , Medium , and High , a first thought would be to delete observations with a value of High . The following program creates a SAS data set by deleting observations that have a Budget value of HIGH:

   /* first attempt */
options pagesize=60 linesize=80 pageno=1 nodate;
data lowmed;
   set mylib.arts;
   if upcase(Budget) = 'HIGH' then delete;
; 

proc print data=lowmed;
   title 'Medium and Low Priced Tours';
run;

The following output displays the results:

Producing a Subset by Deletion

                          Medium and Low Priced Tours                          1

                                          Land               Tour
            Obs    City         Nights    Cost    Budget     Guide

             1     Rome            3       750    Medium    D'Amico
             2     New York        6         .              Lucas  
             3     Madrid          3       370    Low       Torres 
             4     Amsterdam       4       580    Low              

The data set LOWMED contains both the tours that you want and the tour to New York. The inclusion of the tour to New York is erroneous because the value of Budget for the New York observation is missing. Using a subsetting IF statement ensures that the data set contains exactly the observations you want. This DATA step creates the subset with a subsetting IF statement:

   /* a safer method */
options pagesize=60 linesize=80 pageno=1 nodate;
data lowmed2;
   set mylib.arts;
   if upcase(Budget) = 'MEDIUM' or upcase(Budget) = 'LOW';
;

proc print data=lowmed2;
   title 'Medium and Low Priced Tours';
run;

The following output displays the results:

Producing an Exact Subset with Subsetting IF

                          Medium and Low Priced Tours                          1

                                          Land               Tour
            Obs    City         Nights    Cost    Budget     Guide

             1     Rome            3       750    Medium    D'Amico
             2     Madrid          3       370    Low       Torres 
             3     Amsterdam       4       580    Low              
The result is a SAS data set with no missing values for Budget.

Previous Page | Next Page | Top of Page