• Print  |
  • Feedback  |

Knowledge Base


TS-278

How to guarentee that all possible values of a class variable will be displayed in your table regardless of whether or not the value exists.

INPUT:

I'm creating a table to show responses to a questionnaire where hotel guests rate the hotel in various categories. The responses range from 1 (Great) to 4 (Poor). I want the structure of the table to look like this:

   +---------+---------------------------------------+
   |         |           Response          |/////////|
   |Category |  Great  |  Good   |   Fair  |//Poor///|
   |Room     |         |         |         |/////////|
   |Staff    |         |         |         |/////////|
   |Food     |         |         |         |/////////|
   |Cost     |         |         |         |/////////|
   +---------+---------+---------+---------+---------+

How can I get PROC TABULATE to produce a column for Poor, even though no responses of 4 were given?

OUTPUT:

You can ensure the column for Poor will be in the output by appending the needed data values to your data set in a DATA step. If you have a number of values that need to be added or if you are using a BY statement, this task can become tricky. However, you can save yourself a lot of time by following a few simple guidelines.

Try Options First

The first thing to consider is whether you can specify an option to add the missing values you desire. The PRINTMISS TABLE option in PROC TABULATE enables you to display all the possible combinations of CLASS variable values that exist in the data set. Thus, if the CLASS variable value is present in one crossing but not another, all you may need to add is the PRINTMISS option to the TABLE statement.

For example, consider the following data and code:

A B

                  1    1
                  3    1
                  3    3

proc tabulate;

      class a b;
      table a*b, n ;
      table a*b, n / printmiss;

run;

The PROC TABULATE step generates two tables. The first table does not have a row for A=1 and B=3 because the crossing does not exist in the data set. In the second table, the PRINTMISS option creates a row for A=1 and B=3 even though the crossing does not exist. Output 1 shows the results:

Output 1: Tables with and without the PRINTMISS Option


             |                 |     N      |
             |-----------------+------------|
             |A       |B       |            |
             |--------+--------|            |
             |1       |1       |        1.00|
             |--------+--------+------------|
             |3       |1       |        1.00|
             |        |--------+------------|
             |        |3       |        1.00|
             --------------------------------

             --------------------------------
             |                 |     N      |
             |-----------------+------------|
             |A       |B       |            |
             |--------+--------|            |
             |1       |1       |        1.00|
             |        |--------+------------|
             |        |3       |           .|
             |--------+--------+------------|
             |3       |1       |        1.00|
             |        |--------+------------|
             |        |3       |        1.00|
             --------------------------------

If the variable value does not exist in your data set, the PRINTMISS option does not help. At this point, you must add the data value to the data set. This is accomplished by creating a dummy data set and then combining it with the original data set. However, the contents of the dummy data set and the way you combine it depend on whether you plan to use a BY statement in the PROC TABULATE step.

Creating a Dummy Data Set

If the value that you want to add to your output does not exist in the data, you need to add it to the data set. For example, the following is a partial listing of a data set named SURVEY, to which you want to add a response value:

MONTH CATEGORY RESPONSE

               SEP         1           3
               SEP         2           2
               SEP         3           1
               SEP         4           2
               AUG         1           1
               AUG         2           2
               AUG         3           3
               AUG         4           1

You want to add data for the value 4 (Poor) in the variable RESPONSE:

MONTH CATEGORY RESPONSE

SEP 1 4

You can easily create this observation with the following DATA step:

data dummy;

      month='SEP';
      category=1;
      response=4;

run;

The DUMMY data set created by this DATA step is used again in "Concatenating without BY Groups," later in this article.

The values used for MONTH and CATEGORY can be any of the values already existing for these variables in the SURVEY data set.

If you need to add multiple values for the same variable or if you want to ensure that all values will be present, you must create a separate observation for each desired value.

If you plan to use a BY statement in the PROC TABULATE step, the dummy data set should not contain any of the variables that you list in the BY statement. The BY variables need to be omitted from the dummy data set to avoid overwriting a true BY variable value when the dummy data set is combined with the valid data set.

These points are demonstrated in the following example code, which creates a dummy data set to ensure that all values of RESPONSE are present and that MONTH is used as a BY variable:

data dummy;

      category=1;
      input response;
      cards;
      1
      2
      3
      4
      ;

This DATA step creates the following data set, which is used later in the section "Interleaving with BY Groups":

CATEGORY RESPONSE

                   1           1
                   1           2
                   1           3
                   1           4

Now that you have successfully created the necessary DUMMY data set, you need to combine it with the original data set, SURVEY. The way you combine the data depends on whether you plan to use a BY statement in the PROC TABULATE step. The following sections describe the two methods you can use.

Concatenating without BY Groups

If you do not plan to use a BY statement in the PROC TABULATE step, you can use the following DATA step to concatenate DUMMY and SURVEY:

data survey2;

      set survey(in=is_valid) dummy;
      if is_valid then valid=1;

run;

Notice that this DATA step creates a new variable named VALID that distinguishes valid observations in the SURVEY data set from "dummy" observations in the DUMMY data set. This is necessary if you plan to base any calculations on frequency counts. A partial listing of observations from the SURVEY2 data set follows. The last observation is from the DUMMY data set.

MONTH CATEGORY RESPONSE VALID

           SEP         1           3         1
           SEP         2           2         1
           SEP         3           1         1
           SEP         4           2         1
           AUG         1           1         1
           AUG         2           2         1
           AUG         3           3         1
           AUG         4           1         1
           SEP         1           4         .

Once you have the final data set, you can use the following PROC TABULATE code to generate the desired table. (SAS system options and user-defined format definitions used in this code appear at the end of the article.)

proc tabulate data=survey2 format=8.;

      class month category response;
      var valid;
      table (month all='All')*category, response*valid=' '*n=' '
            / rts=22 misstext='0';

      title 'Hotel Survey Results for the Fall';
      format month $monf. response respf. category catf.;
      label month='Month' response='Rating' category='Category';

run;

Notice the VALID variable generates the frequency counts. Because the VALID variable is nonmissing on all the original observations and missing on all the dummy observations, the frequency of VALID represents the frequency counts for the original SURVEY data.

The resulting output appears in Output 2:

Output 2: Final Table without BY-Group Processing

Hotel Survey Results for the Fall


|                    |              Rating               |
|                    |-----------------------------------|
|                    | Great  |  Good  |  Fair  |  Poor  |
|--------------------+--------+--------+--------+--------|
|Month    |Category  |        |        |        |        |
|---------+----------|        |        |        |        |
|August   |Room      |       4|       2|       2|       0|
|         |----------+--------+--------+--------+--------|
|         |Staff     |       2|       3|       3|       0|
|         |----------+--------+--------+--------+--------|
|         |Food      |       2|       4|       2|       0|
|         |----------+--------+--------+--------+--------|
|         |Cost      |       1|       4|       3|       0|
|---------+----------+--------+--------+--------+--------|
|September|Room      |       4|       6|       5|       0|
|         |----------+--------+--------+--------+--------|
|         |Staff     |       2|      10|       3|       0|
|         |----------+--------+--------+--------+--------|
|         |Food      |       4|       6|       5|       0|
|         |----------+--------+--------+--------+--------|
|         |Cost      |       5|       6|       4|       0|
|---------+----------+--------+--------+--------+--------|
|All      |Room      |       8|       8|       7|       0|
|         |----------+--------+--------+--------+--------|
|         |Staff     |       4|      13|       6|       0|
|         |----------+--------+--------+--------+--------|
|         |Food      |       6|      10|       7|       0|
|         |----------+--------+--------+--------+--------|
|         |Cost      |       6|      10|       7|       0|

Interleaving with BY Groups

If you plan to use a BY statement in your PROC TABULATE step, you need to append the DUMMY data set to each BY group, because a BY statement forces the procedure to process each BY group separately. Thus, if you append the DUMMY data set to only one BY group, only that group will contain all of the values.

In creating the DUMMY data set, you only need to produce values for the CLASS variables, not the BY variables. Following the guidelines in the earlier section "Creating a DUMMY Data Set", create the following observations to add to SURVEY:

CATEGORY RESPONSE

                   1           1
                   1           2
                   1           3
                   1           4

If the SURVEY data set is not sorted, you need to use PROC SORT before interleaving the DUMMY data set. The following steps sort the SURVEY data set and interleave DUMMY to SURVEY:

proc sort data=survey out=survey;

by month;
run;

data survey2;

      set survey;
      by month;
      valid=1;
      output;
      if last.month then do;
         valid=.;
         do i=1 to maxob;
            set dummy point=i nobs=maxob;
            output;
         end;
      end;

run;

The data set uses LAST.\il variable\rg processing to detect a change in the value of the BY variable. If you have more than one BY variable, use the last variable in the BY list in the LAST.\il variable\rg syntax.

The first OUTPUT statement writes the current observation from SURVEY to SURVEY2. At the end of each BY group, a DO loop writes each observation in the DUMMY data set to SURVEY2. If the BY value does not change, the DO loop is bypassed.

The new variable named VALID distinguishes observations from the SURVEY and DUMMY data sets. VALID must exist in order to compute accurate frequency counts on the values of the CLASS variables.

The SET statement options POINT= and NOBS= make this DATA step dynamic; NOBS= provides the number of observations in DUMMY, and POINT= accesses each observation in DUMMY within the DO loop.

The following example shows the last few observations for each BY group in the SURVEY2 data set. The last observation in each group is from the DUMMY data set.

MONTH CATEGORY RESPONSE VALID

           AUG         1           1         1
           AUG         2           2         1
           AUG         3           3         1
           AUG         4           1         1
           AUG         1           4         .

           SEP         1           3         1
           SEP         2           2         1
           SEP         3           1         1
           SEP         4           2         1
           SEP         1           4         .

Notice that values for VALID are missing in the observations from the DUMMY data set.

The following PROC TABULATE code generates Output 3. Notice that the frequency counts are computed from the analysis variable, VALID. Also, the variable MONTH is now a BY variable rather than another CLASS variable. (All user-written format definitions and SAS system option values used to produce the output appear at the end of this article.)

proc tabulate data=survey2 format=8.;

      by month;
      class category response;
      var valid;
      table category, response*valid=' '*n=' '
            / rts=22 misstext='0';

      title 'Hotel Survey Results For The Fall';
      format month $monf. response respf. category catf.;
      label month='Month' response='Rating'
         category='Category';

run;

Output 3: Final Tables Using BY-Group Processing

Hotel Survey Results for the Fall

  • Month=August ----------------------

|                    |              Rating               |
|                    |-----------------------------------|
|                    | Great  |  Good  |  Fair  |  Poor  |
|--------------------+--------+--------+--------+--------|
|Category            |        |        |        |        |
|--------------------|        |        |        |        |
|Room                |       4|       2|       2|       0|
|--------------------+--------+--------+--------+--------|
|Staff               |       2|       3|       3|       0|
|--------------------+--------+--------+--------+--------|
|Food                |       2|       4|       2|       0|
|--------------------+--------+--------+--------+--------|
|Cost                |       1|       4|       3|       0|

Hotel Survey Results for the Fall

  • Month=September ---------------------

|                    |              Rating               |
|                    |-----------------------------------|
|                    | Great  |  Good  |  Fair  |  Poor  |
|--------------------+--------+--------+--------+--------|
|Category            |        |        |        |        |
|--------------------|        |        |        |        |
|Room                |       4|       6|       5|       0|
|--------------------+--------+--------+--------+--------|
|Staff               |       2|      10|       3|       0|
|--------------------+--------+--------+--------+--------|
|Food                |       4|       6|       5|       0|
|--------------------+--------+--------+--------+--------|
|Cost                |       5|       6|       4|       0|

Options and Formats Used

The following OPTIONS statement and user-defined formats were used in generating the output in this article:

options pagesize=60 linesize=64 nodate nonumber center;

proc format;

      value $monf 'AUG'='August' 'SEP'='September';
      value respf 1='Great' 2='Good'  3='Fair' 4='Poor';
      value catf  1='Room'  2='Staff' 3='Food' 4='Cost';

run;

For more information on PROC TABULATE, refer to the SAS Guide to TABULATE Processing, Second Edition. For more information on the DATA step, refer to SAS Language: Reference, Version 6, First Edition.

Answer provided by Jason Sharpe and Jeff Lopes. Jason is a Technical Support Analyst at SAS Institute, specializing in base procedures and the macro facility. Jeff is an Editor in the Documentation Development Department at SAS Institute.