SUPPORT / SAMPLES & SAS NOTES
 

Support

Sample 25010: Create a polychoric correlation or distance matrix

DetailsResultsDownloadsAboutRate It

Create a polychoric correlation or distance matrix

Contents: Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / References

NOTE: Beginning in SAS 9.4, this macro is no longer needed. Use the OUTPLC= option in Base SAS PROC CORR to save a matrix of polychoric (or tetrachoric) correlations.

PURPOSE:
The %POLYCHOR macro creates a SAS data set containing a correlation matrix of polychoric correlations or a distance matrix based on polychoric correlations.
HISTORY:
Version
Update Notes
1.7Fixed errors referring to variable type or length of the NAME variable when analyzing character variables. Improved distance example. Added ID=, ORDER=, and PRINTLEVELS= options. Made _NUMERIC_, _CHARACTER_, _ALL_ available with VAR=. Fixed looping if variable named I or J used. Fixed problem if NAME is variable in input data set. Use of ODS now makes SAS 8 the minimum release.
1.6Fixed bug that didn't allow a variable named X in the input data. Default for VAR= is now all variables, not just numeric variables. Fixed problem where correlation from previous variable pair was used when the current pair does not form at least a 2x2 table (a WARNING appears in the log when this happens). Use of %sysfunc requires SAS 6.12 or later. Added automatic check for new version.
1.5Fixed display of converge message. Capture and reset NOTES option at end.
1.4Removed NOSTIMER option. Allow for long variable names in SAS 7 or later. Added version indicator to macro notes.
1.3Print message if no convergence when computing polychoric correlations. Added CONVERGE= and MAXITER= options.
1.2Fixed problem with macro notes not printing.
REQUIREMENTS:
The %POLYCHOR macro requires Version 8 or later of Base SAS Software.
USAGE:
Follow the instructions in the Downloads tab of this sample to save the %POLYCHOR macro definition. Replace the text within quotes in the following statement with the location of the %POLYCHOR macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the %POLYCHOR macro and make it available for use:
   %inc "<location of your file containing the POLYCHOR macro>";

Following this statement, you may call the %POLYCHOR macro. See the Results tab for an example.

The options and allowable values are:

DATA=
SAS data set to be analyzed. If the DATA= option is not supplied, the most recently created SAS data set is used. Input data must be raw data, not data summarized to cell counts. When computing distances among items, this data set should be structured with items as columns and variables as rows as described below.
VAR=
Polychoric or tetrachoric correlations will be computed for every pair of variables listed in the VAR= option. Separate individual variable names with blanks. Keywords _NUMERIC_, _CHARACTER_ and _ALL_ may also be used to represent all numeric variables, all character variables, or all variables in the data set, respectively. Do not repeat variable names. By default, all variables found in the data set will be used. See LIMITATIONS below for time considerations.
OUT=
Specifies the name of the output data set that will contain the correlation or distance matrix. By default, the output data set is named _PLCORR.
TYPE=CORR | DISTANCE
Specifies the type of matrix to be created. If TYPE=CORR (the default), then a correlation matrix is computed and the output data set is assigned a data set type of CORR. If TYPE=DISTANCE, then a distance matrix is computed and the output dat set is assigned a data set type of DISTANCE. See DETAILS below for more information.
ID=
When a distance matrix is requested by specifying TYPE=DISTANCE, ID= names the variable added to the OUT= data set which contains the variable names from the DATA= data set. This variable can be used in subsequent analyses to identify the items. The ID= option is ignored if TYPE=CORR. If ID= is not specified when TYPE=DISTANCE, the variable is named _ID_.
ORDER=DATA | FORMATTED | FREQ | INTERNAL
If specified, this option is placed in all runs of PROC FREQ which is used to compute each correlation. The default is INTERNAL. See the description of this option in the PROC FREQ documentation.
PRINTLEVELS=YES | NO
When PRINTLEVELS=YES (the default), the "Character Variable Levels" table is displayed which shows the levels for all character variables and the order of the levels used when computing the polychoric correlations. See DETAILS below for more information.
CONVERGE=
Specifies the convergence criterion for computing the polychoric correlation. The value of the CONVERGE= option must be a positive number; by default, CONVERGE=0.0001. Iterative computation of the polychoric correlation stops when the convergence measure falls below the value of the CONVERGE= option or when the number of iterations exceeds the value specified in the MAXITER= option, whichever happens first. See DETAILS below for more information.
MAXITER=
Specifies the maximum number of iterations for computing the polychoric correlation. The value of the MAXITER= option must be a positive integer; by default, MAXITER=20. Iterative computation of the polychoric correlation stops when the number of iterations exceeds the value of the MAXITER= option, or when the convergence measure falls below the value of the CONVERGE= option, whichever happens first. See DETAILS below for more information.
DETAILS:
The PLCORR option in the FREQ procedure is used iteratively to compute the polychoric correlation for each pair of variables. If both variables in a pair are binary (that is, they take on only two distinct values), then the correlation computed by the PLCORR option is usually referred to as the tetrachoric correlation.

Convergence problems
The PLCORR option uses an iterative, maximum likelihood method to estimate the polychoric correlation. Occasionally, this method will not converge on an estimate, and in that case, the value of the correlation is set to missing. By adjusting the values of the CONVERGE= and/or MAXITER= options in the TABLE statement of PROC FREQ, you may be able to obtain an estimate. For example, the following statements attempt to estimate the polychoric correlation between variables X1 and X2 setting the convergence criterion to a more lenient 0.001 and allowing more iterations than the default. See the FREQ chapter of the SAS/STAT User's Guide for details on these options.

        proc freq;
           table x1 * x2 / plcorr converge=.001 maxiter=30;
           run;

You can adjust the CONVERGE= and MAXITER= options for the estimation of all polychoric correlations by using the CONVERGE= and MAXITER= options in the %POLYCHOR macro.

Ordering of levels
The variables are assumed to be ordinal variables. PROC FREQ forms a crosstabulation for each pair of variables. Note that the ordering of the rows and columns in a table affects the computation of the polychoric correlation. For instance, two variables with levels LOW, MEDIUM, and HIGH, in this order, will produce a different correlation estimate when ordered MEDIUM, LOW, HIGH. You should verify that all of your variables will be in the desired order for your chosen setting of the ORDER= option. The PRINTLEVELS=YES option displays all character variable levels in the order used when computing the correlations. The ORDER= option affects the ordering of all variables, character and numeric, in the analysis.

Obtaining a correlation matrix
If TYPE=CORR is specified, the individual correlation coefficients are then assembled into a TYPE=CORR data set containing a matrix of polychoric correlations. The resulting data set can be used, but for descriptive analyses only, in either the FACTOR or the CALIS procedure (specify METHOD=ULS in either procedure). If the maximum likelihood method (METHOD=ML) is used, note that none of the hypothesis tests will be valid, and the polychoric correlation matrix may be indefinite with small samples.

Obtaining a distance matrix
If TYPE=DISTANCE is specified, a TYPE=DISTANCE data set is created containing a matrix of dissimilarity values. The dissimilarity value used is computed as 1 - plcorr 2, where plcorr is the polychoric correlation. It is assumed that the columns of the input data set are the items to be clustered and that each row of the data set is a variable (such as HEIGHT) on which the items are measured. Collectively, the variables (rows) locate the items (columns) to be clustered. Note that this data structure is the transpose of the usual data set input to such procedures as PROC CLUSTER in which the items to be clustered are the rows (observations) in the data set and the variables which locate the items are the columns. You can use PROC TRANSPOSE to convert rows to columns. The distance data set created by the %POLYCHOR macro includes a variable containing the item (column) names which can be used in subsequent analyses to identify the items. You can name this variable using the ID= option (the default name is _ID_). The output data set can be used in the CLUSTER procedure (but the CCC value is not valid) or the MODECLUS procedure. As discussed in the documentation of the these procedures and the DISTANCE procedure, variables with higher variability have greater effect on the distance measure. As a result, you may want to standardize the variables before computing distances. This can be done by using PROC STDIZE.

See the Appendix, Special SAS Data Sets in the SAS/STAT User's Guide for a description of TYPE=CORR and DISTANCE data sets.

LIMITATIONS:
The time required to compute the correlation or distance matrix increases quadratically as the number of variables increases. Up to 999 variables are allowed, but the time required for more than 100 variables may be exorbitant.

If a message which begins like this appears in the SAS log:

WARNING: No OUTPUT data set is produced because no statistics can be computed for this
         table, which has ...

it indicates that the current pair of variables does not form at least a 2x2 table. The polychoric correlation can not be computed and is set to missing in the output data set.

If some polychoric correlations could not be estimated and are missing in the OUT= data set, then an attempt to use the data set as input to an analytical procedure such as PROC PRINCOMP results in this message in the SAS log:

ERROR: CORR matrix incomplete in data set WORK._PLCORR.

All correlations must be nonmissing in order to do an analysis.

MISSING VALUES:
Observations with missing values are omitted from the computation of correlations. However, when computing the polychoric correlation between two variables, if an observation's values for these two variables are not missing, then the observation is used regardless of any missing values the observation may have in other variables. Since a matrix of correlations based on differing numbers of observations can be singular, you may want to remove all missing values from the analysis variables prior to use of this macro.

See the DETAILS and LIMITATIONS sections above for information on polychoric correlation estimates that are set to missing.

REFERENCES:
See the Drasgow (1986) and Olsson (1979) references cited in the FREQ chapter of the SAS/STAT User's Guide.



These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.