Special SAS Data Sets


TYPE=CORR Data Sets

A TYPE=CORR data set usually contains a correlation matrix and possibly other statistics including means, standard deviations, and the number of observations in the original SAS data set from which the correlation matrix was computed. Using PROC CORR with an output data set option (OUTP=, OUTS=, OUTK=, OUTH=, or OUT=) produces a TYPE=CORR data set. (For a complete description of the CORR procedure, see the Base SAS Procedures Guide: Statistical Procedures.) The CALIS, CANCORR, CANDISC, DISCRIM, PRINCOMP, and VARCLUS procedures can also create a TYPE=CORR data set with additional statistics (the CORR option is needed in PROC CALIS). A TYPE=CORR data set containing a correlation matrix can be used as input for the ACECLUS, CALIS, CANCORR, CANDISC, DISCRIM, FACTOR, PRINCOMP, REG, SCORE, STEPDISC, and VARCLUS procedures. The variables in a TYPE=CORR data set are as follows:

  • the BY variable or variables, if a BY statement is used with the procedure

  • _TYPE_, a character variable of length eight with values identifying the type of statistic in each observation, such as ’MEAN’, ’STD’, ’N’, and ’CORR’

  • _NAME_, a character variable with values identifying the variable with which a given row of the correlation matrix is associated

  • other variables that were analyzed by the CORR procedure or other procedures

The usual values of the _TYPE_ variable are as follows:

_TYPE_

Contents

MEAN

mean of each variable analyzed

STD

standard deviation of each variable

N

number of observations used in the analysis. PROC CORR records the number of nonmissing values for each variable unless the NOMISS option is used. If the NOMISS option is specified, or if the CALIS, CANCORR, CANDISC, PRINCOMP, or VARCLUS procedure is used to create the data set, observations with one or more missing values are omitted from the analysis, so this value is the same for each variable and provides the number of observations with no missing values. If a FREQ statement is used with the procedure that creates the data set, the number of observations is the sum of the relevant values of the variable in the FREQ statement. Procedures that read a TYPE=CORR data set use the smallest value in the observation with _TYPE_=’N’ as the number of observations in the analysis.

SUMWGT

sum of the observation weights if a WEIGHT statement is used with the procedure that creates the data set. The values are determined analogously to those of the _TYPE_=’N’ observation.

CORR

correlations with the variable named by the _NAME_ variable

There might be additional observations in a TYPE=CORR data set depending on the particular procedure and options used.

If you create a TYPE=CORR data set yourself, the data set need not contain the observations with _TYPE_=’MEAN’, ’STD’, ’N’, or ’SUMWGT’, unless you intend to use one of the discriminant procedures. Procedures assume that all of the means are 0.0 and that the standard deviations are 1.0 if this information is not in the TYPE=CORR data set. If _TYPE_=’N’ does not appear, most procedures assume that the number of observations is 10,000; significance tests and other statistics that depend on the number of observations are, of course, meaningless. In the CALIS and CANCORR procedures, you can use the EDF= option instead of including a _TYPE_=’N’ observation.

A correlation matrix is symmetric; that is, the correlation between X and Y is the same as the correlation between Y and X. The CALIS, CANCORR, CANDISC, CORR, DISCRIM, PRINCOMP, and VARCLUS procedures output the entire correlation matrix. If you create the data set yourself, you need to include only one of the two occurrences of the correlation between two variables; the other can be given a missing value.

If you create a TYPE=CORR data set yourself, the _TYPE_ and _NAME_ variables are not necessary except for use with the discriminant procedures and PROC SCORE. If there is no _TYPE_ variable, then all observations are assumed to contain correlations. If there is no _NAME_ variable, the first observation is assumed to correspond to the first variable in the analysis, the second observation to the second variable, and so on. However, if you omit the _NAME_ variable, you will not be able to analyze arbitrary subsets of the variables or list the variables in a VAR or MODEL statement in a different order.

Example A.1: A TYPE=CORR Data Set Produced by PROC CORR

See Figure A.1 for an example of a TYPE=CORR data set produced by the following SAS statements. Figure A.2 displays partial output from PROC CONTENTS, which indicates that the "Data Set Type" is ’CORR’.

title 'Five Socioeconomic Variables';
title2 'Harman (1976), Modern Factor Analysis, Third Edition';

data SocEcon;
   input Pop School Employ Services House;
   datalines;
5700     12.8      2500      270       25000
1000     10.9      600       10        10000
3400     8.8       1000      10        9000
3800     13.6      1700      140       25000
4000     12.8      1600      140       25000
8200     8.3       2600      60        12000
1200     11.4      400       10        16000
9100     11.5      3300      60        14000
9900     12.5      3400      180       18000
9600     13.7      3600      390       25000
9600     9.6       3300      80        12000
9400     11.4      4000      100       13000
;

proc corr noprint out=corrcorr;
run;
proc print data=corrcorr;
run;
proc contents data=corrcorr;
run;

Figure A.1: A TYPE=CORR Data Set Produced by PROC CORR

Five Socioeconomic Variables
Harman (1976), Modern Factor Analysis, Third Edition

Obs _TYPE_ _NAME_ Pop School Employ Services House
1 MEAN   6241.67 11.4417 2333.33 120.833 17000.00
2 STD   3439.99 1.7865 1241.21 114.928 6367.53
3 N   12.00 12.0000 12.00 12.000 12.00
4 CORR Pop 1.00 0.0098 0.97 0.439 0.02
5 CORR School 0.01 1.0000 0.15 0.691 0.86
6 CORR Employ 0.97 0.1543 1.00 0.515 0.12
7 CORR Services 0.44 0.6914 0.51 1.000 0.78
8 CORR House 0.02 0.8631 0.12 0.778 1.00



Figure A.2: Contents of a TYPE=CORR Data Set

Five Socioeconomic Variables
Harman (1976), Modern Factor Analysis, Third Edition

The CONTENTS Procedure

Data Set Name WORK.CORRCORR Observations 8
Member Type DATA Variables 7
Engine SASE7 Indexes 0
Created DDMMMYY:00:00:00 Observation Length 56
Last Modified DDMMMYY:00:00:00 Deleted Observations 0
Protection   Compressed NO
Data Set Type CORR Sorted NO
Label Pearson Correlation Matrix    
Data Representation Native    
Encoding Session    



Example A.2: Creating a TYPE=CORR Data Set in a DATA Step

This example creates a TYPE=CORR data set by reading a correlation matrix in a DATA step. Figure A.3 shows the resulting data set.

title 'Five Socioeconomic Variables';

data datacorr(type=corr);
   infile cards missover;
   _type_='corr';
   input _Name_ $ Pop School Employ Services House;
   datalines;
Pop        1.00000
School     0.00975   1.00000
Employ     0.97245   0.15428   1.00000
Services   0.43887   0.69141   0.51472   1.00000
House      0.02241   0.86307   0.12193   0.77765   1.00000
;
proc print data=datacorr;
run;

Figure A.3: A TYPE=CORR Data Set Created by a DATA Step

Five Socioeconomic Variables

OBS _type_ _Name_ Pop School Employ Services House
1 corr Pop 1.00000 . . . .
2 corr School 0.00975 1.00000 . . .
3 corr Employ 0.97245 0.15428 1.00000 . .
4 corr Services 0.43887 0.69141 0.51472 1.00000 .
5 corr House 0.02241 0.86307 0.12193 0.77765 1