The HPSUMMARY Procedure

Results

Missing Values

PROC HPSUMMARY excludes missing values for the analysis variables before calculating statistics. Each analysis variable is treated individually; a missing value for an observation in one variable does not affect the calculations for other variables. PROC HPSUMMARY handles missing values as follows:

  • If a classification variable has a missing value for an observation, then PROC HPSUMMARY excludes that observation from the analysis unless you use the MISSING option in the PROC statement or CLASS statement.

  • If a FREQ variable value is missing or nonpositive, then PROC HPSUMMARY excludes the observation from the analysis.

  • If a WEIGHT variable value is missing, then PROC HPSUMMARY excludes the observation from the analysis.

PROC HPSUMMARY tabulates the number of the missing values. Before the number of missing values are tabulated, PROC HPSUMMARY excludes observations with frequencies that are nonpositive when you use the FREQ statement and observations with weights that are missing or nonpositive (when you use the EXCLNPWGT option) when you use the WEIGHT statement. To report this information in the procedure output use the NMISS statistic-keyword in the PROC HPSUMMARY statement.

The N Obs Statistic

By default when you use a CLASS statement, PROC HPSUMMARY displays an additional statistic called N Obs. This statistic reports the total number of observations or the sum of the observations of the FREQ variable that PROC HPSUMMARY processes for each class level. PROC HPSUMMARY might omit observations from this total because of missing values in one or more classification variables. Because of this action and the exclusion of observations when the weight-variable (specified in the WEIGHT statement or in the WEIGHT= option in the VAR statement) contains missing values, there is not always a direct relationship between N Obs, N, and NMISS.

In the output data set, the value of N Obs is stored in the _FREQ_ variable.

Output Data Set

PROC HPSUMMARY creates one output data set. The procedure does not print the output data set. Use the PRINT procedure, the REPORT procedure, or another SAS reporting tool to display the output data set.

Note: By default, the statistics in the output data set automatically inherit the analysis variable’s format and label. However, statistics computed for N, NMISS, SUMWGT, USS, CSS, VAR, CV, T, PROBT, PRT, SKEWNESS, and KURTOSIS do not inherit the analysis variable’s format because this format can be invalid for these statistics.

The output data set can contain these variables:

  • the variables specified in the CLASS statement.

  • the variable _TYPE_ that contains information about the classification variables. By default _TYPE_ is a numeric variable. If you specify CHARTYPE in the PROC statement, then _TYPE_ is a character variable. When you use more than 32 classification variables, _TYPE_ is automatically a character variable.

  • the variable _FREQ_ that contains the number of observations that a given output level represents.

  • the variables requested in the OUTPUT statement that contain the output statistics and extreme values.

  • the variable _STAT_ that contains the names of the default statistics if you omit statistic keywords.

The value of _TYPE_ indicates which combination of the classification variables PROC HPSUMMARY uses to compute the statistics. The character value of _TYPE_ is a series of zeros and ones, where each value of one indicates an active classification variable in the type. For example, with three classification variables, PROC HPSUMMARY represents type 1 as 001, type 5 as 101, and so on.

Usually, the output data set contains one observation per level per type. However, if you omit statistic-keywords in the OUTPUT statement, then the output data set contains five observations per level (six if you specify a WEIGHT variable). Therefore, the total number of observations in the output data set is equal to the sum of the levels for all the types that you request multiplied by 1, 5, or 6, whichever is applicable.

If you omit the CLASS statement (_TYPE_= 0), then there is always exactly one level of output per output data set. If you use a CLASS statement, then the number of levels for each type that you request has an upper bound equal to the number of observations in the input data set. By default, PROC HPSUMMARY generates all possible types. In this case the total number of levels for each output data set has an upper bound equal to

\[  m \cdot (2^ k - 1) \cdot n + 1  \]

where $k$ is the number of classification variables, $n$ is the number of observations in the input data set, and m is 1, 5, or 6.

PROC HPSUMMARY determines the actual number of levels for a given type from the number of unique combinations of each active classification variable. A single level consists of all input observations whose formatted class values match.

Table 9.8 shows the values of _TYPE_ and the number of observations in the data set when you specify one, two, and three classification variables.

Table 9.8: The Effect of Classification Variables on the OUTPUT Data Set

CLASS Variables

 

C

B

A

_WAY_

_TYPE_

Subgroup defined by

Number of observations of this _TYPE_ and _WAY_ in the data set

Total number of observations in the data set

0

0

0

0

0

Total

1

 

0

0

1

1

1

A

a

1+a

0

1

0

1

2

B

b

 

0

1

1

2

3

A*B

a*b

1+a+b+a*b

1

0

0

1

4

C

c

 

1

0

1

2

5

A*C

a*c

 

1

1

0

2

6

B*C

b*c

1+a+b+a*b+c

1

1

1

3

7

A*B*C

a*b*c

+a*c+b*c+a*b*c

Character binary equivalent of _TYPE_ (CHARTYPE option in the PROC HPSUMMARY statement)

     

a,b,c = number of levels of A, B, C, respectively