Acting on Selected Observations |
Types of Character Comparisons |
Some special situations occur when you make character comparisons. You might need to do the following:
Select all values beginning with a particular group of characters.
Select all values beginning with a particular range of characters.
Find a particular value anywhere within another character value.
Comparing Uppercase and Lowercase Characters |
SAS distinguishes between uppercase and lowercase letters in comparisons. For example, the values Madrid and MADRID are not equivalent. To compare values that may occur in different cases, use the UPCASE function to produce an uppercase value; then make the comparison between two uppercase values, as shown here:
options pagesize=60 linesize=80 pageno=1 nodate; data newguide; set mylib.arttours; if upcase(City) = 'MADRID' then TourGuide = 'Balarezo'; run; proc print data=newguide; var City TourGuide; title 'Tour Guides'; run;
Within the comparison, SAS produces an uppercase version of the value of City and compares it to the uppercase constant MADRID. The value of City in the observation remains in its original case. The following output displays the results:
Data Set Produced by an Uppercase Comparison
Tour Guides 1 Tour Obs City Guide 1 Rome D'Amico 2 Paris Lucas 3 London Wilson 4 New York Lucas 5 Madrid Balarezo 6 Amsterdam
Now Balarezo is assigned as the tour guide for Madrid because the UPCASE function compares the uppercase value of Madrid with the value MADRID. The UPCASE function enables SAS to read the two values as equal.
Selecting All Values That Begin with the Same Group of Characters |
Sometimes you need to select a group of character values, such as all tour guides whose names begin with the letter D.
By default, SAS compares values of different lengths by adding blanks to the end of the shorter value and testing the result against the longer value. In this example,
/* first attempt */ if Tourguide = 'D' then Chosen = 'Yes'; else Chosen = 'No';
SAS interprets the comparison as
TourGuide = 'D '
where D is followed by seven blanks (because TourGuide, a character variable created by column input, has a length of eight bytes). Because the value of TourGuide never consists of the single letter D, the comparison is never true.
To compare a long value to a shorter standard, put a colon (:) after the operator, as in this example:
/* correct method */ if TourGuide =: 'D' then Chosen = 'Yes'; else Chosen = 'No';
The colon causes SAS to compare the same number of characters in the shorter value and the longer value. In this case, the shorter string contains one character; therefore, SAS tests only the first character from the longer value. All names beginning with a D make the comparison true. (If you are not sure that all the values of TourGuide begin with a capital letter, then use the UPCASE function.) The following DATA step selects names beginning with D:
options pagesize=60 linesize=80 pageno=1 nodate; data dguide; set mylib.arttours; if TourGuide =: 'D' then Chosen = 'Yes'; else Chosen = 'No'; run; proc print data=dguide; var City TourGuide Chosen; title 'Guides Whose Names Begin with D'; run;
The following output displays the results:
Selecting All Values That Begin with a Particular String
Guides Whose Names Begin with D 1 Tour Obs City Guide Chosen 1 Rome D'Amico Yes 2 Paris Lucas No 3 London Wilson No 4 New York Lucas No 5 Madrid Torres No 6 Amsterdam No
Selecting a Range of Character Values |
You may want to select values beginning with a range of characters, such as all names beginning with A through L or M through Z. To select a range of character values, you need to understand the following points:
In computer processing, letters have magnitude. A is the smallest letter in the alphabet and Z is the largest. Therefore, the comparison A<B is true; so is the comparison D>C.(footnote 1)
The following statements divide the names of the guides into two groups beginning with A-L and M-Z by combining the comparison operator with the colon:
if TourGuide <=: 'L' then TourGuideGroup = 'A-L'; else TourGuideGroup = 'M-Z';
The following DATA step creates the groups:
options pagesize=60 linesize=80 pageno=1 nodate; data guidegrp; set mylib.arttours; if TourGuide <=: 'L' then TourGuideGroup = 'A-L'; else TourGuideGroup = 'M-Z'; run; proc print data=guidegrp; var City TourGuide TourGuideGroup; title 'Tour Guide Groups'; run;
The following output displays the results:
Selecting All Values Beginning with a Range of Characters
Tour Guide Groups 1 Tour Tour Guide Obs City Guide Group 1 Rome D'Amico A-L 2 Paris Lucas A-L 3 London Wilson M-Z 4 New York Lucas A-L 5 Madrid Torres M-Z 6 Amsterdam A-L
All names beginning with A through L, as well as the missing value, go into group A-L. The missing value goes into that group because a blank is smaller than any letter.
Finding a Value Anywhere within Another Character Value |
A data set is needed that lists tours that visit other attractions in addition to museums and galleries. In the data set MYLIB.ARTTOURS, the variable EventDescription refers to those events as other . However, the position of the word other varies in different observations. How can it be determined that other exists anywhere in the value of EventDescription for a given observation?
The INDEX function determines whether a specified character string (the excerpt) is present within a particular character value (the source):
INDEX (source,excerpt) |
The following statements select observations containing the string other :
if index(EventDescription,'other') > 0 then OtherEvents = 'Yes'; else OtherEvents = 'No';
You can also write the condition as
if index(EventDescription,'other') then OtherEvents = 'Yes'; else OtherEvents = 'No';
The second example uses the fact that any value other than 0 or missing makes the condition true. This statement is included in the following DATA step:
options pagesize=60 linesize=80 pageno=1 nodate; data otherevent; set mylib.arttours; if index(EventDescription,'other') then OtherEvents = 'Yes'; else OtherEvents = 'No'; run; proc print data=otherevent; var City EventDescription OtherEvents; title 'Tour Events'; run;
The following output displays the results:
Finding a Character String within Another Value
Tour Events 1 Other Obs City EventDescription Events 1 Rome 4 M, 3 G No 2 Paris 5 M, 1 other Yes 3 London 3 M, 2 G No 4 New York 5 M, 1 G, 2 other Yes 5 Madrid 3 M, 2 other Yes 6 Amsterdam 3 M, 3 G No
In the observations for Paris and Madrid, the INDEX function returns the value 8 because the string other is found in the eighth field of the variable (5 M, 1 other for Paris and 3 M, 2 other for Madrid). For New York, it returns the value 13 because the string other is found in the thirteenth field of the variable (5 M, 1 G, 2 other). In the remaining observations, the function does not find the string other and returns a 0.
FOOTNOTE 1: The magnitude of letters in the alphabet is true for all operating environments under which SAS runs. Other points, such as whether uppercase or lowercase letters are larger and how to treat numbers in character values, depend on your operating system. For more information about how character values are sorted under various operating environments, see Working with Grouped or Sorted Observations.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.