Working with Character Variables |
Extracting a Portion of a Character Value |
Some character values may contain multiple pieces of information that need to be isolated and assigned to separate character variables. For example, the value of ArrivalDepartureGates contains two cities: the city of arrival and the city of departure. How can the individual values be isolated so that separate variables can be created for the two cities?
The SCAN function returns a character string when it is given the source string, the position of the desired character string, and a character delimiter:
SCAN (source,n<,list-of-delimiters>) |
The source is the value that you want to examine. It can be any kind of character expression, including character variables, character constants, and so on. The n is the position of the term to be selected from the source. The list-of-delimiters can list one, multiple, or no delimiters. If you specify more than one delimiter, then SAS uses any of them; if you omit the delimiter, then SAS divides words according to a default list of delimiters (including the blank and some special characters).
For example, to select the first term in the value of ArrivalDepartureGates and assign it to a new variable named ArrivalGate, write
ArrivalGate = scan(ArrivalDepartureGates,1,',');
The SCAN function examines the value of ArrivalDepartureGates and selects the first string as identified by a comma.
Although default values can be used for the delimiter, it is a good idea to specify the delimiter to be used. If the default delimiter is used in the SCAN function when the observation for Brazil is processed, then SAS recognizes a blank space as the delimiter and selects Rio rather than Rio de Janeiro as the first term. Specifying the delimiter enables you to control where the division of the term occurs.
To select the second term from ArrivalDepartureGates and assign it to a new variable term named DEPARTUREGATE, specify the following:
DepartureGate = scan(ArrivalDepartureGates,2,',');
Note: The default length of a target variable where the expression contains the SCAN function is 200 bytes.
Remember that SAS maintains the existing alignment of a character value used in an expression; it does not perform any automatic realignment. This example creates the values for a new variable DepartureGate from the values of ArrivalDepartureGates. The value of ArrivalDepartureGates contains a comma and a blank between the two city names as shown in the following output:
Dividing Values into Separate Words Using the SCAN Function
Data Set AIR.DEPARTURES 1 Cities Obs Country InTour USGate ArrivalDepartureGates 1 Japan 5 San Francisco Tokyo, Osaka 2 Italy 8 New York Rome, Naples 3 Australia 12 Honolulu Sydney, Brisbane 4 Venezuela 4 Miami Caracas, Maracaibo 5 Brazil 4 Rio de Janeiro, BelemWhen the SCAN function divides the names at the comma, the second term begins with a blank; therefore, all the values that are assigned to DepartureGate begin with a blank.
To left-align the values, use the LEFT function:
LEFT (source) |
The LEFT function produces a value that has all leading blanks in the source moved to the right side of the value; therefore, the result is left aligned. The source can be any kind of character expression, including a character variable, a character constant enclosed in quotation marks, or another character function.
This example uses the LEFT function in the second assignment statement:
DepartureGate = scan(ArrivalDepartureGates,2,','); DepartureGate = left(DepartureGate);
You can also nest the two functions:
DepartureGate = left(scan(ArrivalDepartureGates,2,','));
When you nest functions, SAS performs the action in the innermost function first. It uses the result of that function as the argument of the next function, and so on.
The following DATA step creates separate variables for the arrival gates and the departure gates:
options pagesize=60 linesize=80 pageno=1 nodate; data gates; set mylib.departures; ArrivalGate = scan(ArrivalDepartureGates,1,','); DepartureGate = left(scan(ArrivalDepartureGates,2,',')); run; proc print data=gates; var Country ArrivalDepartureGates ArrivalGate DepartureGate; title 'Arrival and Departure Gates'; run;
The following output displays the results:
Dividing Values into Separate Words with the SCAN Function
Arrival and Departure Gates 1 Departure Obs Country ArrivalDepartureGates ArrivalGate Gate 1 Japan Tokyo, Osaka Tokyo Osaka 2 Italy Rome, Naples Rome Naples 3 Australia Sydney, Brisbane Sydney Brisbane 4 Venezuela Caracas, Maracaibo Caracas Maracaibo 5 Brazil Rio de Janeiro, Belem Rio de Janeiro Belem
The SCAN function causes SAS to assign a length of 200 bytes to the target variable in an assignment statement. Most of the other character functions cause the target to have the same length as the original value. In the data set GATELENGTH, the variable ArrivalGate has a length of 200 because the SCAN function creates it. The variable DepartureGate also has a length of 200 because the argument of the LEFT function contains the SCAN function.
Setting the lengths of ArrivalGate and DepartureGate to the needed values rather than to the default length saves a lot of storage space. Because SAS sets the length of a character variable the first time SAS encounters it, the LENGTH statement must appear before the assignment statements that create values for the variables:
data gatelength; length ArrivalGate $ 14 DepartureGate $ 9; set mylib.departures; ArrivalGate = scan(ArrivalDepartureGate,1,','); DepartureGate = left(scan(ArrivalDepartureGate,2,',')); run;
Combining Character Values: Using Concatenation |
SAS enables you to combine character values into longer ones using an operation known as concatenation. Concatenation combines character values by placing them one after the other and assigning them to a variable. In SAS programming, the concatenation operator is a pair of vertical bars (||). If your keyboard does not have a solid vertical bar, use two broken vertical bars (¦¦) or two exclamation points (!!). The length of the new variable is the sum of the lengths of the pieces or number of characters that is specified in a LENGTH statement for the new variable. Concatenation is illustrated in the following figure:
The following statement combines all the cities named as gateways into a single variable named AllGates:
AllGates = USGate || ArrivalDepartureGates;
SAS attaches the beginning of each value of ArrivalDepartureGates to the end of each value of USGate and assigns the results to AllGates. The following DATA step includes this statement:
/* first try */ options pagesize=60 linesize=80 pageno=1 nodate; data all; set mylib.departures; AllGates = USGate || ArrivalDepartureGates; run; proc print data=all; var Country USGate ArrivalDepartureGates AllGates; title 'All Tour Gates'; run;
The following output displays the results:
Simple Concatenation: Interior Blanks Not Removed
All Tour Gates 1 Obs Country USGate ArrivalDepartureGates 1 Japan San Francisco Tokyo, Osaka 2 Italy New York Rome, Naples 3 Australia Honolulu Sydney, Brisbane 4 Venezuela Miami Caracas, Maracaibo 5 Brazil Rio de Janeiro, Belem Obs AllGates 1 San FranciscoTokyo, Osaka 2 New York Rome, Naples 3 Honolulu 1 Sydney, Brisbane 4 Miami Caracas, Maracaibo 5 2 Rio de Janeiro, Belem
Why, in the previous output, does
the beginning of AllGates in the Brazil observation contain blanks? |
When a character value is shorter than the length of the variable to which it belongs, SAS pads the value with trailing blanks. The length of USGate is 13 bytes, but only San Francisco uses all of them. Therefore, the other values contain blanks at the end, and the value for Brazil is entirely blank. SAS concatenates USGate and ArrivalDepartureGates without change; therefore, the middle of AllGates contains blanks for most observations. Most of the values of ArrivalDepartureGates also contain trailing blanks. If you concatenate another variable such as Country to ArrivalDepartureGates, you will see the trailing blanks in ArrivalDepartureGates.To eliminate trailing blanks, use the TRIM function:
TRIM (source) |
Note: Other rules about trailing blanks in SAS still apply. If the trimmed result is shorter than the length of the variable to which the result is assigned, SAS pads the result with new blanks as it makes the assignment.
To eliminate the trailing blanks in USGate from AllGates, add the TRIM function to the expression:
AllGate2 = trim(USGate) || ArrivalDepartureGates;
The following program adds this statement to the DATA step:
/* removing interior blanks */ options pagesize=60 linesize=80 pageno=1 nodate; data all2; set mylib.departures; AllGate2 = trim(USGate) || ArrivalDepartureGates; run; proc print data=all2; var Country USGate ArrivalDepartureGates AllGate2; title 'All Tour Gates'; run;
The following output displays the results:
Removing Blanks with the TRIM Function
All Tour Gates 1
Obs Country USGate ArrivalDepartureGates AllGate2
1 Japan San Francisco Tokyo, Osaka San FranciscoTokyo, Osaka
2 Italy New York Rome, Naples New YorkRome, Naples
3 Australia Honolulu Sydney, Brisbane HonoluluSydney, Brisbane
4 Venezuela Miami Caracas, Maracaibo MiamiCaracas, Maracaibo
5 Brazil Rio de Janeiro, Belem Rio de Janeiro, Belem
1
Notice at that the AllGate2 value for Brazil has a blank space before Rio de Janeiro, Belem. When the TRIM function encounters a missing value in the argument, one blank space is returned. In this observation, USGate has a missing value; therefore, one blank space is concatenated with Rio de Janeiro, Belem.
Data set ALL2 shows that removing the trailing blanks from USGate causes all the values of ArrivalDepartureGates to appear immediately after the corresponding values of USGate. To make the result easier to read, you can concatenate a comma and blank between the trimmed value of USGate and the value of ArrivalDepartureGates. Also, to align the AllGate3 value for Brazil with all other values of AllGate3, use an IF-THEN statement to equate the value of AllGate3 with the value of ArrivalDepartureGates in that observation.
AllGate3 = trim(USGate)||', '||ArrivalDepartureGates; if Country = 'Brazil' then AllGate3 = ArrivalDepartureGates;
This DATA step includes these statements:
/* final version */ options pagesize=60 linesize=80 pageno=1 nodate; data all3; set mylib.departures; AllGate3 = trim(USGate)||', '||ArrivalDepartureGates; if Country = 'Brazil' then AllGate3 = ArrivalDepartureGates; run; proc print data=all3; var Country USGate ArrivalDepartureGates AllGate3; title 'All Tour Gates'; run;
The following output displays the results:
Concatenating Additional Characters for Readability
All Tour Gates 1 Obs Country USGate ArrivalDepartureGates AllGate3 1 Japan San Francisco Tokyo, Osaka San Francisco, Tokyo, Osaka 2 Italy New York Rome, Naples New York, Rome, Naples 3 Australia Honolulu Sydney, Brisbane Honolulu, Sydney, Brisbane 4 Venezuela Miami Caracas, Maracaibo Miami, Caracas, Maracaibo 5 Brazil Rio de Janeiro, Belem Rio de Janeiro, Belem
When you concatenate variables, you might see the apparent loss of part of a concatenated value. Earlier in this section, ArrivalDepartureGates was divided into two new variables, ArrivalGate and DepartureGate, each with a default length of 200 bytes. (Remember that when a variable is created by an expression that uses the SCAN function, the variable length is 200 bytes.) For reference, this example re-creates the DATA step:
options pagesize=60 linesize=80 pageno=1 nodate; data gates; set mylib.departures; ArrivalGate = scan(ArrivalDepartureGates,1,','); DepartureGate = left(scan(ArrivalDepartureGates,2,',')); run;
If the variables ArrivalGate and DepartureGate are concatenated, as they are in the next DATA step, then the length of the resulting concatenation is 402 bytes: 200 bytes for each variable and 1 byte each for the comma and the blank space. This example uses the VLENGTH function to show the length of ADGates.
/* accidentally omitting the TRIM function */ options pagesize=60 linesize=80 pageno=1 nodate; data gates2; set gates; ADGates = ArrivalGate||', '||DepartureGate;; ADLength = vlength(ADGates); run; proc print data=gates2; var Country ArrivalDepartureGates ADGates ADLength; title 'All Tour Gates'; run;
The following output displays the results:
Losing Part of a Concatenated Value
All Tour Gates 1 Obs Country ArrivalDepartureGates 1 Japan Tokyo, Osaka 2 Italy Rome, Naples 3 Australia Sydney, Brisbane 4 Venezuela Caracas, Maracaibo 5 Brazil Rio de Janeiro, Belem Obs ADGates 1 Tokyo 2 Rome 3 Sydney 4 Caracas 5 Rio de Janeiro Obs ADLength 1 402 2 402 3 402 4 402 5 402
The concatenated value from DepartureGate appears to be truncated in the output. It has been concatenated after the trailing blanks of ArrivalGate, and it does not appear because the output does not display 402 bytes.
There is a two-step solution to the problem:
The TRIM function can trim the trailing blanks from ArrivalGate, as shown in the preceding section. The significant characters from all three pieces that are assigned to ADGates can then fit in the output.
The length of ADGates remains 402 bytes. The LENGTH statement can assign to the variable a length that is shorter but large enough to contain the significant pieces.
The following DATA step uses the TRIM function and the LENGTH statement to remove interior blanks from the concatenation:
options pagesize=60 linesize=80 pageno=1 nodate; data gates3; length ADGates $ 30; set gates; ADGates = trim(ArrivalGate)||', '||DepartureGate; run; proc print data=gates3; var country ArrivalDepartureGates ADGates; title 'All Tour Gates'; run;
The following output displays the results:
Showing All of a Newly Concatenated Value
All Tour Gates 1 Obs Country ArrivalDepartureGates ADGates 1 Japan Tokyo, Osaka Tokyo, Osaka 2 Italy Rome, Naples Rome, Naples 3 Australia Sydney, Brisbane Sydney, Brisbane 4 Venezuela Caracas, Maracaibo Caracas, Maracaibo 5 Brazil Rio de Janeiro, Belem Rio de Janeiro, Belem
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.