Introduction to Categorical Data Analysis Procedures


Introduction

A categorical variable is a variable that assumes only a limited number of discrete values. The measurement scale for a categorical variable is unrestricted. It can be nominal, which means that the observed levels are not ordered. It can be ordinal, which means that the observed levels are ordered in some way. Or it can be interval, which means that the observed levels are ordered and numeric and that any interval of one unit on the scale of measurement represents the same amount, regardless of its location on the scale. One example of a categorical variable is litter size; another is the number of times a subject has been married. A variable that lies on a nominal scale is sometimes called a qualitative or classification variable.

Categorical data result from observations on multiple subjects where one or more categorical variables are observed for each subject. If there is only one categorical variable, then the data are generally represented by a frequency table, which lists each observed value of the variable and its frequency of occurrence.

If there are two or more categorical variables, then a subject’s profile is defined as the subject’s observed values for each of the variables. Such categorical data can be represented by a frequency table that lists each observed profile and its frequency of occurrence.

If there are exactly two categorical variables, then the data are often represented by a two-dimensional contingency table, which has one row for each level of variable 1 and one column for each level of variable 2. The intersections of rows and columns, called cells, correspond to variable profiles, and each cell contains the frequency of occurrence of the corresponding profile.

If there are more than two categorical variables, then the data can be represented by a multidimensional contingency table. There are two commonly used methods for displaying such tables, and both require that the variables be divided into two sets.

  • In the first method, one set contains a row variable and a column variable for a two-dimensional contingency table, and the second set contains all of the other variables. The variables in the second set are used to form a set of profiles. Thus, the data are represented as a series of two-dimensional contingency tables, one for each profile. This is the data representation used by PROC FREQ. For example, if you request tables for RACE*SEX*AGE*INCOME, the FREQ procedure represents the data as a series of contingency tables: the row variable is AGE, the column variable is INCOME, and the combinations of levels of RACE and SEX form a set of profiles.

  • In the second method, one set contains the independent variables, and the other set contains the dependent variables. Profiles based on the independent variables are called population profiles, whereas those based on the dependent variables are called response profiles. A two-dimensional contingency table is then formed, with one row for each population profile and one column for each response profile. Since any subject can have only one population profile and one response profile, the contingency table is uniquely defined. This is the data representation used by the modeling procedures.

Note: Modeling procedures for categorical data analysis only require that the response variable be categorical—the explanatory variables are allowed to be continuous or categorical. However, note that PROC CATMOD was designed to handle contingency table data, and it does not efficiently handle continuous covariates.