Acting on Selected Observations

Comparing Characters

Some special situations occur when you make character comparisons. You might need to do the following:

Compare uppercase and lowercase characters.
Select all values beginning with a particular group of characters.
Select all values beginning with a particular range of characters.
Find a particular value anywhere within another character value.

Comparing Uppercase and Lowercase Characters

SAS distinguishes between uppercase and lowercase letters in comparisons. For example, the values Madrid and MADRID are not equivalent. To compare values that may occur in different cases, use the UPCASE function to produce an uppercase value; then make the comparison between two uppercase values, as shown here:

options pagesize=60 linesize=80 pageno=1 nodate;
data newguide;
   set mylib.arttours;
   if upcase(City) = 'MADRID' then TourGuide = 'Balarezo';
run;

proc print data=newguide;
   var City TourGuide;
   title 'Tour Guides';
run;

Within the comparison, SAS produces an uppercase version of the value of City and compares it to the uppercase constant MADRID. The value of City in the observation remains in its original case. The following output displays the results:

Data Set Produced by an Uppercase Comparison

                                  Tour Guides                                  1

                                              Tour
                          Obs    City         Guide

                           1     Rome         D'Amico 
                           2     Paris        Lucas   
                           3     London       Wilson  
                           4     New York     Lucas   
                           5     Madrid       Balarezo
                           6     Amsterdam

Now Balarezo is assigned as the tour guide for Madrid because the UPCASE function compares the uppercase value of Madrid with the value MADRID. The UPCASE function enables SAS to read the two values as equal.

Selecting All Values That Begin with the Same Group of Characters

Sometimes you need to select a group of character values, such as all tour guides whose names begin with the letter D.

By default, SAS compares values of different lengths by adding blanks to the end of the shorter value and testing the result against the longer value. In this example,

   /* first attempt */
if Tourguide = 'D' then Chosen = 'Yes';
else Chosen = 'No';

SAS interprets the comparison as

TourGuide = 'D       '

where D is followed by seven blanks (because TourGuide, a character variable created by column input, has a length of eight bytes). Because the value of TourGuide never consists of the single letter D, the comparison is never true.

To compare a long value to a shorter standard, put a colon (:) after the operator, as in this example:

   /* correct method */
if TourGuide =: 'D' then Chosen = 'Yes';
else Chosen = 'No';

The colon causes SAS to compare the same number of characters in the shorter value and the longer value. In this case, the shorter string contains one character; therefore, SAS tests only the first character from the longer value. All names beginning with a D make the comparison true. (If you are not sure that all the values of TourGuide begin with a capital letter, then use the UPCASE function.) The following DATA step selects names beginning with D:

options pagesize=60 linesize=80 pageno=1 nodate;
data dguide;
   set mylib.arttours;
   if TourGuide =: 'D' then Chosen = 'Yes';
   else Chosen = 'No';
run;

proc print data=dguide;
   var City TourGuide Chosen;
   title 'Guides Whose Names Begin with D';
run;

The following output displays the results:

Selecting All Values That Begin with a Particular String

                        Guides Whose Names Begin with D                        1

                                          Tour
                     Obs    City          Guide     Chosen

                      1     Rome         D'Amico     Yes  
                      2     Paris        Lucas       No   
                      3     London       Wilson      No   
                      4     New York     Lucas       No   
                      5     Madrid       Torres      No   
                      6     Amsterdam                No

Selecting a Range of Character Values

You may want to select values beginning with a range of characters, such as all names beginning with A through L or M through Z. To select a range of character values, you need to understand the following points:

In computer processing, letters have magnitude. A is the smallest letter in the alphabet and Z is the largest. Therefore, the comparison A<B is true; so is the comparison D>C.(footnote 1)
A blank is smaller than any letter.

The following statements divide the names of the guides into two groups beginning with A-L and M-Z by combining the comparison operator with the colon:

if TourGuide <=: 'L' then TourGuideGroup = 'A-L';
   else TourGuideGroup = 'M-Z';

The following DATA step creates the groups:

options pagesize=60 linesize=80 pageno=1 nodate;
data guidegrp;
   set mylib.arttours;
   if TourGuide <=: 'L' then TourGuideGroup = 'A-L';
   else TourGuideGroup = 'M-Z';
run;

proc print data=guidegrp;
   var City TourGuide TourGuideGroup;
   title 'Tour Guide Groups';
run;

The following output displays the results:

Selecting All Values Beginning with a Range of Characters

                               Tour Guide Groups                               1

                                                     Tour
                                           Tour      Guide
                      Obs    City          Guide     Group

                       1     Rome         D'Amico     A-L 
                       2     Paris        Lucas       A-L 
                       3     London       Wilson      M-Z 
                       4     New York     Lucas       A-L 
                       5     Madrid       Torres      M-Z 
                       6     Amsterdam                A-L

All names beginning with A through L, as well as the missing value, go into group A-L. The missing value goes into that group because a blank is smaller than any letter.

Finding a Value Anywhere within Another Character Value

A data set is needed that lists tours that visit other attractions in addition to museums and galleries. In the data set MYLIB.ARTTOURS, the variable EventDescription refers to those events as other . However, the position of the word other varies in different observations. How can it be determined that other exists anywhere in the value of EventDescription for a given observation?

The INDEX function determines whether a specified character string (the excerpt) is present within a particular character value (the source):

INDEX (source,excerpt)

Both source and excerpt can be any kind of character expression, including character strings enclosed in quotation marks, character variables, and other character functions. If excerpt does occur within source, then the function returns the position of the first character of excerpt, which is a positive number. If it does not, then the function returns a 0. By testing for a value greater than 0, you can determine whether a particular character string is present in another character value.

The following statements select observations containing the string other :

if index(EventDescription,'other') > 0 then OtherEvents = 'Yes';
else OtherEvents = 'No';

You can also write the condition as

if index(EventDescription,'other') then OtherEvents = 'Yes';
else OtherEvents = 'No';

The second example uses the fact that any value other than 0 or missing makes the condition true. This statement is included in the following DATA step:

options pagesize=60 linesize=80 pageno=1 nodate;
data otherevent;
   set mylib.arttours;
   if index(EventDescription,'other') then OtherEvents = 'Yes';
   else OtherEvents = 'No';
run;

proc print data=otherevent;
   var City EventDescription OtherEvents;
   title 'Tour Events';
run;

The following output displays the results:

Finding a Character String within Another Value

                                  Tour Events                                  1

                                                         Other
                Obs    City         EventDescription     Events

                 1     Rome         4 M, 3 G              No   
                 2     Paris        5 M, 1 other          Yes  
                 3     London       3 M, 2 G              No   
                 4     New York     5 M, 1 G, 2 other     Yes  
                 5     Madrid       3 M, 2 other          Yes  
                 6     Amsterdam    3 M, 3 G              No

In the observations for Paris and Madrid, the INDEX function returns the value 8 because the string other is found in the eighth field of the variable (5 M, 1 other for Paris and 3 M, 2 other for Madrid). For New York, it returns the value 13 because the string other is found in the thirteenth field of the variable (5 M, 1 G, 2 other). In the remaining observations, the function does not find the string other and returns a 0.

FOOTNOTE 1: The magnitude of letters in the alphabet is true for all operating environments under which SAS runs. Other points, such as whether uppercase or lowercase letters are larger and how to treat numbers in character values, depend on your operating system. For more information about how character values are sorted under various operating environments, see Working with Grouped or Sorted Observations.

Top of Page