25020 - One-way ANOVA on summary data

SUPPORT / SAMPLES & SAS NOTES

Support

Sample 25020: One-way ANOVA on summary data

One-way ANOVA on summary data

Contents:

Purpose / Requirements / Usage / Details / Limitations / References

NOTE: For comparing two group means using summary data, use SAS/STAT PROC TTEST. See the example in the TTEST documentation.

PURPOSE:

Perform a one-way analysis of variance on an existing SAS data set that contains only summary data.

REQUIREMENTS:

Version 6 or later of base SAS Software and SAS/STAT Software is required.

USAGE:

Follow the instructions in the Downloads tab of this sample to save the %SUM_GLM macro definition. Replace the text within quotes in the following statement with the location of the %SUM_GLM macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the %SUM_GLM macro and make it available for use:

   %inc "<location of your file containing the SUM_GLM macro>";

Following this statement, you may call the %SUM_GLM macro. See the Results tab for an example.

The input data set, specified in the data= option should be structured such that each observation contains the summary statistics for a single level of the group= variable. The data set must have variables containing the group levels, sample sizes, means, and standard deviations. Optionally, variables for BY-group processing may also appear, but if specified, the data set must be sorted by the BY variables prior to calling the %SUM_GLM macro.

The following parameters are required when using the macro:

group=

Name of the classification (grouping) variable.

n=

Name of the variable containing sample sizes.

mean=

Name of the variable containing the means.

stddev=

Name of the variable containing the standard deviations.

The following parameters are optional:

data=

Name of the SAS data set containing the summary data. If not specified, the last-created data set is used.

lsopts=

Any valid option for the LSMEANS statement in the GLM Procedure.

by=

Names of any BY variable(s).

DETAILS:

The %SUM_GLM macro is based on the methods presented in Larson (1992). In this paper, a method of generating surrogate data to represent the summary data is given and an analysis of this data is performed.

The macro creates the data set _WORKING which can be directly analyzed using PROC GLM using the FREQ statement:

freq freq;

The response variable in this data set is named Y. If the GLM analysis done by the macro is not exactly as desired, you can use GLM to reanalyze the _WORKKING data set by including the above FREQ statement in your GLM step.

LIMITATIONS:

Only one-way models are addressed in Larson's paper and in this macro. No error checking is performed in the macro.

REFERENCES:

Larson, David A. (1992), "Analysis of Variance With Just Summary Statistics as Input," American Statistician, 46, 151-152.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

EXAMPLE:

In this example, a one-way analysis is done for each of two BY groups (A and B). To illustrate the equivalence of this method of analyzing summary data to analyzing the original data, the example begins with the analysis of an unsummarized data set. The data are then summarized and an analysis of the summarized data is done using the %SUM_GLM macro

       data fulldata;
         input bygroup $ treat response @@;
       cards;
       A 1  7.6  A 1  8.3  A 1  7.6
       A 2  8.5  A 2  8.7  A 2  7.7  A 2  8.3  A 2  8.7
       A 3  6.8  A 3  6.7  A 3  6.6  A 3  6.4
       A 4  7.4  A 4  6.5  A 4  6.8
       B 1 15.5  B 1 13.8  B 1 14.2  B 1 17.3
       B 2 10.6  B 2 12.6  B 2 15.7  B 2 12.6  B 2 13.5  B 2 11.8
       B 3 20.5  B 3 17.7  B 3 19.1  B 3 21.1  B 3 16.9  B 3 18.7
       B 4 16.4  B 4 13.8  B 4 17.4  B 4 18.8  B 4 19.1
       B 5 16.1  B 5 14.4  B 5 13.0
       ;

The following statements perform the one-way analysis of the original unsummarized data. For brevity, the ODS SELECT statement restricts the tables that are displayed.

       ods select overallanova fitstatistics lsmeans diff;
       proc glm data=fulldata;
         title "One-way analysis of unsummarized data";
         by bygroup;
         class treat;
         model response = treat;
         lsmeans treat / stderr tdiff e;
         run;

Following are the analysis results for each BY group.

One-way analysis of unsummarized data

The GLM Procedure

Dependent Variable: response

bygroup=A

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	8.34716667	2.78238889	20.11	<.0001
Error	11	1.52216667	0.13837879
Corrected Total	14	9.86933333

R-Square	Coeff Var	Root MSE	response Mean
0.845768	4.955502	0.371993	7.506667

One-way analysis of unsummarized data

The GLM Procedure

Least Squares Means

bygroup=A

treat	response LSMEAN	Standard Error	Pr > \|t\|	LSMEAN Number
1	7.83333333	0.21477026	<.0001	1
2	8.38000000	0.16636032	<.0001	2
3	6.62500000	0.18599650	<.0001	3
4	6.90000000	0.21477026	<.0001	4

Least Squares Means for Effect treat t for H0: LSMean(i)=LSMean(j) / Pr > \|t\| Dependent Variable: response
i/j	1	2	3	4
1		-2.01228 0.0693	4.252983 0.0014	3.072894 0.0106
2	2.01228 0.0693		7.032927 <.0001	5.447881 0.0002
3	-4.25298 0.0014	-7.03293 <.0001		-0.96792 0.3539
4	-3.07289 0.0106	-5.44788 0.0002	0.96792 0.3539

One-way analysis of unsummarized data

The GLM Procedure

Dependent Variable: response

bygroup=B

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	4	130.3183333	32.5795833	10.61	0.0001
Error	19	58.3200000	3.0694737
Corrected Total	23	188.6383333

R-Square	Coeff Var	Root MSE	response Mean
0.690837	11.04776	1.751991	15.85833

One-way analysis of unsummarized data

The GLM Procedure

Least Squares Means

bygroup=B

treat	response LSMEAN	Standard Error	Pr > \|t\|	LSMEAN Number
1	15.2000000	0.8759957	<.0001	1
2	12.8000000	0.7152475	<.0001	2
3	19.0000000	0.7152475	<.0001	3
4	17.1000000	0.7835144	<.0001	4
5	14.5000000	1.0115127	<.0001	5

Least Squares Means for Effect treat t for H0: LSMean(i)=LSMean(j) / Pr > \|t\| Dependent Variable: response
i/j	1	2	3	4	5
1		2.122193 0.0472	-3.36014 0.0033	-1.61665 0.1224	0.523128 0.6069
2	-2.12219 0.0472		-6.12943 <.0001	-4.05323 0.0007	-1.37225 0.1860
3	3.360139 0.0033	6.129434 <.0001		1.79096 0.0892	3.632416 0.0018
4	1.616648 0.1224	4.053226 0.0007	-1.79096 0.0892		2.032086 0.0564
5	-0.52313 0.6069	1.372246 0.1860	-3.63242 0.0018	-2.03209 0.0564

These statements display the summary statistics for each BY group of the original data.

       proc sort data=fulldata;
         by bygroup;
         run;
       proc means data=fulldata mean std;
         by bygroup;
         class treat;
         var response;
         title "Summary statistics from original data";
         run;

Summary statistics from original data

The MEANS Procedure

bygroup=A

Analysis Variable : response
treat	N Obs	Mean	Std Dev
1	3	7.8333333	0.4041452
2	5	8.3800000	0.4147288
3	4	6.6250000	0.1707825
4	3	6.9000000	0.4582576

bygroup=B

Analysis Variable : response
treat	N Obs	Mean	Std Dev
1	4	15.2000000	1.5769168
2	6	12.8000000	1.7216271
3	6	19.0000000	1.6037456
4	5	17.1000000	2.1424285
5	3	14.5000000	1.5524175

The following illustrates creating the input data set of summary statistics using a DATA step. This is the method you would use if you were presented with a listing of the summary statistics such as the above.

       data summary;
         input count means std bygroup $ treat;
         cards;
       3  7.8333  0.4041 A 1
       5  8.3800  0.4147 A 2
       4  6.6250  0.1708 A 3
       3  6.9000  0.4583 A 4
       4  15.200  1.5769 B 1
       6  12.800  1.7216 B 2
       6  19.000  1.6038 B 3
       5  17.100  2.1424 B 4
       3  14.500  1.5524 B 5
       ;

Since we have the unsummarized data in this example, note that the summary data set could be created using PROC SUMMARY as follows:

       proc summary data=fulldata nway;
         class bygroup treat;
         var response;
         output out=summary2 mean=means std=std n=count;
         run;
       proc print; 
         run;

Obs	bygroup	treat	_TYPE_	_FREQ_	means	std	count
1	A	1	3	3	7.8333	0.40415	3
2	A	2	3	5	8.3800	0.41473	5
3	A	3	3	4	6.6250	0.17078	4
4	A	4	3	3	6.9000	0.45826	3
5	B	1	3	4	15.2000	1.57692	4
6	B	2	3	6	12.8000	1.72163	6
7	B	3	3	6	19.0000	1.60375	6
8	B	4	3	5	17.1000	2.14243	5
9	B	5	3	3	14.5000	1.55242	3

While not necessary in this case since the data are in sorted order, the input data set must generally be sorted by the BY variables before analysis by the macro.

       proc sort data=summary;
         by bygroup;
         run;

The following statements define and run the SUM_GLM macro on the summary statistics to reproduce the analyses of the original data. Slight numerical differences from the analysis of unsummarized data are due to using limited precision when inputting the summary statistics above.

       /* Define the SUM_GLM macro */
       %inc "<location of your file containing the SUM_GLM macro>";
       
       ods select overallanova fitstatistics lsmeans diff;
       %sum_glm(Data=summary,
                N=count,
                Mean=means,
                StdDev=std,
                LSopts=stderr tdiff e,
                By=bygroup,
                Group=Treat)

SUM_GLM Macro: Analysis of Variance on Summary Statistics

The GLM Procedure

Dependent Variable: y

bygroup=A

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	8.34710134	2.78236711	20.11	<.0001
Error	11	1.52209368	0.13837215
Corrected Total	14	9.86919502

R-Square	Coeff Var	Root MSE	y Mean
0.845773	4.955387	0.371984	7.506660

SUM_GLM Macro: Analysis of Variance on Summary Statistics

The GLM Procedure

Least Squares Means

bygroup=A

treat	y LSMEAN	Standard Error	Pr > \|t\|	LSMEAN Number
1	7.83330000	0.21476511	<.0001	1
2	8.38000000	0.16635634	<.0001	2
3	6.62500000	0.18599204	<.0001	3
4	6.90000000	0.21476511	<.0001	4

Least Squares Means for Effect treat t for H0: LSMean(i)=LSMean(j) / Pr > \|t\| Dependent Variable: y
i/j	1	2	3	4
1		-2.01245 0.0693	4.252967 0.0014	3.072858 0.0106
2	2.012451 0.0693		7.033096 <.0001	5.448011 0.0002
3	-4.25297 0.0014	-7.0331 <.0001		-0.96794 0.3539
4	-3.07286 0.0106	-5.44801 0.0002	0.967943 0.3539

SUM_GLM Macro: Analysis of Variance on Summary Statistics

The GLM Procedure

Dependent Variable: y

bygroup=B

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	4	130.3183333	32.5795833	10.61	0.0001
Error	19	58.3196484	3.0694552
Corrected Total	23	188.6379817

R-Square	Coeff Var	Root MSE	y Mean
0.690838	11.04773	1.751986	15.85833

SUM_GLM Macro: Analysis of Variance on Summary Statistics

The GLM Procedure

Least Squares Means

bygroup=B

treat	y LSMEAN	Standard Error	Pr > \|t\|	LSMEAN Number
1	15.2000000	0.8759930	<.0001	1
2	12.8000000	0.7152453	<.0001	2
3	19.0000000	0.7152453	<.0001	3
4	17.1000000	0.7835120	<.0001	4
5	14.5000000	1.0115096	<.0001	5

Least Squares Means for Effect treat t for H0: LSMean(i)=LSMean(j) / Pr > \|t\| Dependent Variable: y
i/j	1	2	3	4	5
1		2.1222 0.0472	-3.36015 0.0033	-1.61665 0.1224	0.523129 0.6069
2	-2.1222 0.0472		-6.12945 <.0001	-4.05324 0.0007	-1.37225 0.1860
3	3.360149 0.0033	6.129452 <.0001		1.790966 0.0892	3.632427 0.0018
4	1.616653 0.1224	4.053238 0.0007	-1.79097 0.0892		2.032092 0.0564
5	-0.52313 0.6069	1.37225 0.1860	-3.63243 0.0018	-2.03209 0.0564

Type:	Sample
Topic:	Analytics ==> Regression Analytics ==> Analysis of Variance Analytics ==> Longitudinal Analysis SAS Reference ==> Procedures ==> GLM

Date Modified:	2007-08-14 03:03:09
Date Created:	2005-01-13 15:03:51

Product Family	Product	Host	SAS Release
			Starting	Ending
SAS System	SAS/STAT	All	n/a	n/a

Support

Sample 25020: One-way ANOVA on summary data

One-way ANOVA on summary data

Operating System and Release Information

Follow Us

What is...