Special SAS Data Sets |
TYPE=CORR Data Sets |
A TYPE=CORR data set usually contains a correlation matrix and possibly other statistics including means, standard deviations, and the number of observations in the original SAS data set from which the correlation matrix was computed. Using PROC CORR with an output data set option (OUTP=, OUTS=, OUTK=, OUTH=, or OUT=) produces a TYPE=CORR data set. (For a complete description of the CORR procedure, see the Base SAS Procedures Guide: Statistical Procedures.) The CALIS, CANCORR, CANDISC, DISCRIM, PRINCOMP, and VARCLUS procedures can also create a TYPE=CORR data set with additional statistics. A TYPE=CORR data set containing a correlation matrix can be used as input for the ACECLUS, CALIS, CANCORR, CANDISC, DISCRIM, FACTOR, PRINCOMP, REG, SCORE, STEPDISC, and VARCLUS procedures. The variables in a TYPE=CORR data set are as follows:
the BY variable or variables, if a BY statement is used with the procedure
_TYPE_, a character variable of length eight with values identifying the type of statistic in each observation, such as ’MEAN’, ’STD’, ’N’, and ’CORR’
_NAME_, a character variable with values identifying the variable with which a given row of the correlation matrix is associated
other variables that were analyzed by the CORR procedure or other procedures
The usual values of the _TYPE_ variable are as follows:
_TYPE_ |
Contents |
---|---|
MEAN |
mean of each variable analyzed |
STD |
standard deviation of each variable |
N |
number of observations used in the analysis. PROC CORR records the number of nonmissing values for each variable unless the NOMISS option is used. If the NOMISS option is specified, or if the CALIS, CANCORR, CANDISC, PRINCOMP, or VARCLUS procedure is used to create the data set, observations with one or more missing values are omitted from the analysis, so this value is the same for each variable and provides the number of observations with no missing values. If a FREQ statement is used with the procedure that creates the data set, the number of observations is the sum of the relevant values of the variable in the FREQ statement. Procedures that read a TYPE=CORR data set use the smallest value in the observation with _TYPE_=’N’ as the number of observations in the analysis. |
SUMWGT |
sum of the observation weights if a WEIGHT statement is used with the procedure that creates the data set. The values are determined analogously to those of the _TYPE_=’N’ observation. |
CORR |
correlations with the variable named by the _NAME_ variable |
There might be additional observations in a TYPE=CORR data set depending on the particular procedure and options used.
If you create a TYPE=CORR data set yourself, the data set need not contain the observations with _TYPE_=’MEAN’, ’STD’, ’N’, or ’SUMWGT’, unless you intend to use one of the discriminant procedures. Procedures assume that all of the means are 0.0 and that the standard deviations are 1.0 if this information is not in the TYPE=CORR data set. If _TYPE_=’N’ does not appear, most procedures assume that the number of observations is 10,000; significance tests and other statistics that depend on the number of observations are, of course, meaningless. In the CALIS and CANCORR procedures, you can use the EDF= option instead of including a _TYPE_=’N’ observation.
A correlation matrix is symmetric; that is, the correlation between X and Y is the same as the correlation between Y and X. The CALIS, CANCORR, CANDISC, CORR, DISCRIM, PRINCOMP, and VARCLUS procedures output the entire correlation matrix. If you create the data set yourself, you need to include only one of the two occurrences of the correlation between two variables; the other can be given a missing value.
If you create a TYPE=CORR data set yourself, the _TYPE_ and _NAME_ variables are not necessary except for use with the discriminant procedures and PROC SCORE. If there is no _TYPE_ variable, then all observations are assumed to contain correlations. If there is no _NAME_ variable, the first observation is assumed to correspond to the first variable in the analysis, the second observation to the second variable, and so on. However, if you omit the _NAME_ variable, you will not be able to analyze arbitrary subsets of the variables or list the variables in a VAR or MODEL statement in a different order.
See Figure A.1 for an example of a TYPE=CORR data set produced by the following SAS statements. Figure A.2 displays partial output from PROC CONTENTS, which indicates that the "Data Set Type" is ’CORR’.
title 'Five Socioeconomic Variables'; title2 'Harman (1976), Modern Factor Analysis, Third Edition'; data SocEcon; input Pop School Employ Services House; datalines; 5700 12.8 2500 270 25000 1000 10.9 600 10 10000 3400 8.8 1000 10 9000 3800 13.6 1700 140 25000 4000 12.8 1600 140 25000 8200 8.3 2600 60 12000 1200 11.4 400 10 16000 9100 11.5 3300 60 14000 9900 12.5 3400 180 18000 9600 13.7 3600 390 25000 9600 9.6 3300 80 12000 9400 11.4 4000 100 13000 ; proc corr noprint out=corrcorr; run;
proc print data=corrcorr; run;
proc contents data=corrcorr; run;
Five Socioeconomic Variables |
Harman (1976), Modern Factor Analysis, Third Edition |
Obs | _TYPE_ | _NAME_ | Pop | School | Employ | Services | House |
---|---|---|---|---|---|---|---|
1 | MEAN | 6241.67 | 11.4417 | 2333.33 | 120.833 | 17000.00 | |
2 | STD | 3439.99 | 1.7865 | 1241.21 | 114.928 | 6367.53 | |
3 | N | 12.00 | 12.0000 | 12.00 | 12.000 | 12.00 | |
4 | CORR | Pop | 1.00 | 0.0098 | 0.97 | 0.439 | 0.02 |
5 | CORR | School | 0.01 | 1.0000 | 0.15 | 0.691 | 0.86 |
6 | CORR | Employ | 0.97 | 0.1543 | 1.00 | 0.515 | 0.12 |
7 | CORR | Services | 0.44 | 0.6914 | 0.51 | 1.000 | 0.78 |
8 | CORR | House | 0.02 | 0.8631 | 0.12 | 0.778 | 1.00 |
Five Socioeconomic Variables |
Harman (1976), Modern Factor Analysis, Third Edition |
Data Set Name | WORK.CORRCORR | Observations | 8 |
---|---|---|---|
Member Type | DATA | Variables | 7 |
Engine | SASE7 | Indexes | 0 |
Created | DDMMMYY:00:00:00 | Observation Length | 56 |
Last Modified | DDMMMYY:00:00:00 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | CORR | Sorted | NO |
Label | Pearson Correlation Matrix | ||
Data Representation | Native | ||
Encoding | Session |
This example creates a TYPE=CORR data set by reading a correlation matrix in a DATA step. Figure A.3 shows the resulting data set.
title 'Five Socioeconomic Variables'; data datacorr(type=corr); infile cards missover; _type_='corr'; input _Name_ $ Pop School Employ Services House; datalines; Pop 1.00000 School 0.00975 1.00000 Employ 0.97245 0.15428 1.00000 Services 0.43887 0.69141 0.51472 1.00000 House 0.02241 0.86307 0.12193 0.77765 1.00000 ;
proc print data=datacorr; run;
Copyright © SAS Institute, Inc. All Rights Reserved.