Previous Page | Next Page

Working with Character Variables

Creating New Character Values


Extracting a Portion of a Character Value


Understanding the SCAN Function

Some character values may contain multiple pieces of information that need to be isolated and assigned to separate character variables. For example, the value of ArrivalDepartureGates contains two cities: the city of arrival and the city of departure. How can the individual values be isolated so that separate variables can be created for the two cities?

The SCAN function returns a character string when it is given the source string, the position of the desired character string, and a character delimiter:

SCAN (source,n<,list-of-delimiters>)

The source is the value that you want to examine. It can be any kind of character expression, including character variables, character constants, and so on. The n is the position of the term to be selected from the source. The list-of-delimiters can list one, multiple, or no delimiters. If you specify more than one delimiter, then SAS uses any of them; if you omit the delimiter, then SAS divides words according to a default list of delimiters (including the blank and some special characters).

For example, to select the first term in the value of ArrivalDepartureGates and assign it to a new variable named ArrivalGate, write

ArrivalGate = scan(ArrivalDepartureGates,1,',');

The SCAN function examines the value of ArrivalDepartureGates and selects the first string as identified by a comma.

Although default values can be used for the delimiter, it is a good idea to specify the delimiter to be used. If the default delimiter is used in the SCAN function when the observation for Brazil is processed, then SAS recognizes a blank space as the delimiter and selects Rio rather than Rio de Janeiro as the first term. Specifying the delimiter enables you to control where the division of the term occurs.

To select the second term from ArrivalDepartureGates and assign it to a new variable term named DEPARTUREGATE, specify the following:

DepartureGate = scan(ArrivalDepartureGates,2,',');

Note:   The default length of a target variable where the expression contains the SCAN function is 200 bytes.  [cautionend]


Aligning New Values

Remember that SAS maintains the existing alignment of a character value used in an expression; it does not perform any automatic realignment. This example creates the values for a new variable DepartureGate from the values of ArrivalDepartureGates. The value of ArrivalDepartureGates contains a comma and a blank between the two city names as shown in the following output:

Dividing Values into Separate Words Using the SCAN Function

                            Data Set AIR.DEPARTURES                            1

                          Cities
      Obs    Country      InTour    USGate           ArrivalDepartureGates

       1     Japan           5      San Francisco    Tokyo, Osaka         
       2     Italy           8      New York         Rome, Naples         
       3     Australia      12      Honolulu         Sydney, Brisbane     
       4     Venezuela       4      Miami            Caracas, Maracaibo   
       5     Brazil          4                       Rio de Janeiro, Belem
When the SCAN function divides the names at the comma, the second term begins with a blank; therefore, all the values that are assigned to DepartureGate begin with a blank.

To left-align the values, use the LEFT function:

LEFT (source)

The LEFT function produces a value that has all leading blanks in the source moved to the right side of the value; therefore, the result is left aligned. The source can be any kind of character expression, including a character variable, a character constant enclosed in quotation marks, or another character function.

This example uses the LEFT function in the second assignment statement:

DepartureGate = scan(ArrivalDepartureGates,2,',');
DepartureGate = left(DepartureGate);

You can also nest the two functions:

DepartureGate = left(scan(ArrivalDepartureGates,2,','));

When you nest functions, SAS performs the action in the innermost function first. It uses the result of that function as the argument of the next function, and so on.

The following DATA step creates separate variables for the arrival gates and the departure gates:

options pagesize=60 linesize=80 pageno=1 nodate;

data gates;
   set mylib.departures;
   ArrivalGate = scan(ArrivalDepartureGates,1,',');
   DepartureGate = left(scan(ArrivalDepartureGates,2,','));
run;

proc print data=gates;
   var Country ArrivalDepartureGates ArrivalGate DepartureGate;
   title 'Arrival and Departure Gates';
run;

The following output displays the results:

Dividing Values into Separate Words with the SCAN Function

                          Arrival and Departure Gates                          1

                                                                   Departure
    Obs    Country      ArrivalDepartureGates    ArrivalGate         Gate

     1     Japan        Tokyo, Osaka             Tokyo             Osaka    
     2     Italy        Rome, Naples             Rome              Naples   
     3     Australia    Sydney, Brisbane         Sydney            Brisbane 
     4     Venezuela    Caracas, Maracaibo       Caracas           Maracaibo
     5     Brazil       Rio de Janeiro, Belem    Rio de Janeiro    Belem    

Saving Storage Space When Using the SCAN Function

The SCAN function causes SAS to assign a length of 200 bytes to the target variable in an assignment statement. Most of the other character functions cause the target to have the same length as the original value. In the data set GATELENGTH, the variable ArrivalGate has a length of 200 because the SCAN function creates it. The variable DepartureGate also has a length of 200 because the argument of the LEFT function contains the SCAN function.

Setting the lengths of ArrivalGate and DepartureGate to the needed values rather than to the default length saves a lot of storage space. Because SAS sets the length of a character variable the first time SAS encounters it, the LENGTH statement must appear before the assignment statements that create values for the variables:

data gatelength;
   length ArrivalGate $ 14 DepartureGate $ 9;
   set mylib.departures;
   ArrivalGate = scan(ArrivalDepartureGate,1,',');
   DepartureGate = left(scan(ArrivalDepartureGate,2,','));
run;


Combining Character Values: Using Concatenation


Understanding Concatenation of Variable Values

SAS enables you to combine character values into longer ones using an operation known as concatenation. Concatenation combines character values by placing them one after the other and assigning them to a variable. In SAS programming, the concatenation operator is a pair of vertical bars (||). If your keyboard does not have a solid vertical bar, use two broken vertical bars (¦¦) or two exclamation points (!!). The length of the new variable is the sum of the lengths of the pieces or number of characters that is specified in a LENGTH statement for the new variable. Concatenation is illustrated in the following figure:

Concatenation of Two Values

[Concatenation of Two Values]


Performing a Simple Concatenation

The following statement combines all the cities named as gateways into a single variable named AllGates:

AllGates = USGate || ArrivalDepartureGates;

SAS attaches the beginning of each value of ArrivalDepartureGates to the end of each value of USGate and assigns the results to AllGates. The following DATA step includes this statement:

   /* first try */
options pagesize=60 linesize=80 pageno=1 nodate;
data all;
   set mylib.departures;
   AllGates = USGate || ArrivalDepartureGates;
run;

proc print data=all;
   var Country USGate ArrivalDepartureGates AllGates;
   title 'All Tour Gates';
run;

The following output displays the results:

Simple Concatenation: Interior Blanks Not Removed

                             All Tour Gates                                  1

           Obs    Country      USGate           ArrivalDepartureGates

            1     Japan        San Francisco    Tokyo, Osaka         
            2     Italy        New York         Rome, Naples         
            3     Australia    Honolulu         Sydney, Brisbane     
            4     Venezuela    Miami            Caracas, Maracaibo   
            5     Brazil                        Rio de Janeiro, Belem

           Obs                 AllGates

            1     San FranciscoTokyo, Osaka         
            2     New York     Rome, Naples         
            3     Honolulu  1   Sydney, Brisbane     
            4     Miami        Caracas, Maracaibo   
            5     2             Rio de Janeiro, Belem

Removing Interior Blanks

Why, in the previous output, does

[1] the middle of AllGates contain blanks?

[2] the beginning of AllGates in the Brazil observation contain blanks?

When a character value is shorter than the length of the variable to which it belongs, SAS pads the value with trailing blanks. The length of USGate is 13 bytes, but only San Francisco uses all of them. Therefore, the other values contain blanks at the end, and the value for Brazil is entirely blank. SAS concatenates USGate and ArrivalDepartureGates without change; therefore, the middle of AllGates contains blanks for most observations. Most of the values of ArrivalDepartureGates also contain trailing blanks. If you concatenate another variable such as Country to ArrivalDepartureGates, you will see the trailing blanks in ArrivalDepartureGates.To eliminate trailing blanks, use the TRIM function:

TRIM (source)
The TRIM function produces a value without the trailing blanks in the source.

Note:   Other rules about trailing blanks in SAS still apply. If the trimmed result is shorter than the length of the variable to which the result is assigned, SAS pads the result with new blanks as it makes the assignment.  [cautionend]

To eliminate the trailing blanks in USGate from AllGates, add the TRIM function to the expression:

AllGate2 = trim(USGate) || ArrivalDepartureGates;

The following program adds this statement to the DATA step:

   /* removing interior blanks */
options pagesize=60 linesize=80 pageno=1 nodate;
data all2;
   set mylib.departures;
   AllGate2 = trim(USGate) || ArrivalDepartureGates;
run;

proc print data=all2;
   var Country USGate ArrivalDepartureGates AllGate2;
   title 'All Tour Gates';
run;

The following output displays the results:

Removing Blanks with the TRIM Function

                                 All Tour Gates                                1

Obs  Country    USGate         ArrivalDepartureGates          AllGate2

 1   Japan      San Francisco  Tokyo, Osaka           San FranciscoTokyo, Osaka
 2   Italy      New York       Rome, Naples           New YorkRome, Naples     
 3   Australia  Honolulu       Sydney, Brisbane       HonoluluSydney, Brisbane 
 4   Venezuela  Miami          Caracas, Maracaibo     MiamiCaracas, Maracaibo  
 5   Brazil                    Rio de Janeiro, Belem   Rio de Janeiro, Belem   
                                                       1 

Notice at [1] that the AllGate2 value for Brazil has a blank space before Rio de Janeiro, Belem. When the TRIM function encounters a missing value in the argument, one blank space is returned. In this observation, USGate has a missing value; therefore, one blank space is concatenated with Rio de Janeiro, Belem.


Adding Additional Characters

Data set ALL2 shows that removing the trailing blanks from USGate causes all the values of ArrivalDepartureGates to appear immediately after the corresponding values of USGate. To make the result easier to read, you can concatenate a comma and blank between the trimmed value of USGate and the value of ArrivalDepartureGates. Also, to align the AllGate3 value for Brazil with all other values of AllGate3, use an IF-THEN statement to equate the value of AllGate3 with the value of ArrivalDepartureGates in that observation.

AllGate3 = trim(USGate)||', '||ArrivalDepartureGates;
if Country = 'Brazil' then AllGate3 = ArrivalDepartureGates;

This DATA step includes these statements:

   /* final version */
options pagesize=60 linesize=80 pageno=1 nodate;
data all3;
   set mylib.departures;
   AllGate3 = trim(USGate)||', '||ArrivalDepartureGates;
   if Country = 'Brazil' then AllGate3 = ArrivalDepartureGates;
run;

proc print data=all3;
   var Country USGate ArrivalDepartureGates AllGate3;
   title 'All Tour Gates';
run;

The following output displays the results:

Concatenating Additional Characters for Readability

                                 All Tour Gates                                1

 Obs Country   USGate        ArrivalDepartureGates          AllGate3

  1  Japan     San Francisco Tokyo, Osaka          San Francisco, Tokyo, Osaka
  2  Italy     New York      Rome, Naples          New York, Rome, Naples     
  3  Australia Honolulu      Sydney, Brisbane      Honolulu, Sydney, Brisbane 
  4  Venezuela Miami         Caracas, Maracaibo    Miami, Caracas, Maracaibo  
  5  Brazil                  Rio de Janeiro, Belem Rio de Janeiro, Belem      

Troubleshooting: When New Variables Appear Truncated

When you concatenate variables, you might see the apparent loss of part of a concatenated value. Earlier in this section, ArrivalDepartureGates was divided into two new variables, ArrivalGate and DepartureGate, each with a default length of 200 bytes. (Remember that when a variable is created by an expression that uses the SCAN function, the variable length is 200 bytes.) For reference, this example re-creates the DATA step:

options pagesize=60 linesize=80 pageno=1 nodate;
data gates;
   set mylib.departures;
   ArrivalGate = scan(ArrivalDepartureGates,1,',');
   DepartureGate = left(scan(ArrivalDepartureGates,2,','));
run;

If the variables ArrivalGate and DepartureGate are concatenated, as they are in the next DATA step, then the length of the resulting concatenation is 402 bytes: 200 bytes for each variable and 1 byte each for the comma and the blank space. This example uses the VLENGTH function to show the length of ADGates.

   /* accidentally omitting the TRIM function */
options pagesize=60 linesize=80 pageno=1 nodate;
data gates2;
   set gates;
   ADGates = ArrivalGate||', '||DepartureGate;;
   ADLength = vlength(ADGates);
run;

proc print data=gates2;
   var Country ArrivalDepartureGates ADGates ADLength;
   title 'All Tour Gates';
run;

The following output displays the results:

Losing Part of a Concatenated Value

          All Tour Gates         1

Obs Country      ArrivalDepartureGates

 1  Japan        Tokyo, Osaka         
 2  Italy        Rome, Naples         
 3  Australia    Sydney, Brisbane     
 4  Venezuela    Caracas, Maracaibo   
 5  Brazil       Rio de Janeiro, Belem

Obs                                   ADGates

 1  Tokyo                                                                       
 2  Rome                                                                        
 3  Sydney                                                                      
 4  Caracas                                                                     
 5  Rio de Janeiro                                                              

Obs ADLength

 1     402  
 2     402  
 3     402  
 4     402  
 5     402  

The concatenated value from DepartureGate appears to be truncated in the output. It has been concatenated after the trailing blanks of ArrivalGate, and it does not appear because the output does not display 402 bytes.

There is a two-step solution to the problem:

  1. The TRIM function can trim the trailing blanks from ArrivalGate, as shown in the preceding section. The significant characters from all three pieces that are assigned to ADGates can then fit in the output.

  2. The length of ADGates remains 402 bytes. The LENGTH statement can assign to the variable a length that is shorter but large enough to contain the significant pieces.

The following DATA step uses the TRIM function and the LENGTH statement to remove interior blanks from the concatenation:

options pagesize=60 linesize=80 pageno=1 nodate;
data gates3;
   length ADGates $ 30; 
   set gates;
   ADGates = trim(ArrivalGate)||', '||DepartureGate;
run;

proc print data=gates3;
   var country ArrivalDepartureGates ADGates;
   title 'All Tour Gates';
run;

The following output displays the results:

Showing All of a Newly Concatenated Value

                                 All Tour Gates                                1

       Obs    Country      ArrivalDepartureGates    ADGates

        1     Japan        Tokyo, Osaka             Tokyo, Osaka         
        2     Italy        Rome, Naples             Rome, Naples         
        3     Australia    Sydney, Brisbane         Sydney, Brisbane     
        4     Venezuela    Caracas, Maracaibo       Caracas, Maracaibo   
        5     Brazil       Rio de Janeiro, Belem    Rio de Janeiro, Belem

Previous Page | Next Page | Top of Page