The SURVEYIMPUTE Procedure

Getting Started: SURVEYIMPUTE Procedure

This example shows how you can use PROC SURVEYIMPUTE to impute missing values and compute imputation-adjusted statistics for sample survey data. The example uses simulated data from a customer satisfaction survey for a student information system (SIS), which is a software product that provides modules for student registration, class scheduling, attendance, grade reporting, and other functions.

The software company conducted a survey of school personnel who use the SIS. A probability sample of SIS users was selected from the study population, which included SIS users at middle schools and high schools in three states, Georgia, South Carolina, and North Carolina. The sample design for this survey was a two-stage stratified design. A first-stage sample of schools was selected from the list of schools in the three states that use the SIS. The list of schools, which are the primary sampling units (PSU), was stratified by state and by customer status (whether the school was a new user or a renewal user of the system). Within the strata, schools were selected with probability proportional to size and with replacement, where the size measure was school enrollment. From each sample school, five staff members were randomly selected with replacement as the second-stage units to complete the SIS satisfaction questionnaire. These staff members include both teachers and administrators.

The SAS data set SIS_Survey_Sub contains the survey results and the sample design information that is needed to analyze the data. The data set contains the following items:

State: state where the school is located
NewUser: 1 if the school is a new user of SIS or 0 if not
School: school identification (PSU)
SamplingWeight: sampling weight
Department: 0 for teachers and 1 for administrators
Response: coded from 1 to 5, where 1 represents "Very Unsatisfied" and 5 represents "Very Satisfied"

The following statements request the imputation of missing values for Department and Response by using the fully efficient fractional imputation (FEFI) method:

proc surveyimpute data=SIS_Survey_Sub method=fefi varmethod=jackknife;
   class Department Response;
   var Department Response;
   strata State NewUser;
   cluster School;
   weight SamplingWeight;
   output out=SIS_Survey_Imputed outjkcoefs=SIS_JKCoefs;
run;

The PROC SURVEYIMPUTE statement invokes the procedure. The DATA= option in the PROC SURVEYIMPUTE statement specifies the input data set containing the missing values, the METHOD=FEFI option requests the fully efficient fractional imputation method, and the VARMETHOD= option requests the imputation-adjusted jackknife replicate weights. The CLASS statement specifies the classification variables. The STRATA, CLUSTER, and WEIGHT statements specify the strata, clusters (PSUs), and weight variables. The VAR statement specifies the variables to be imputed (Department and Response). By default, both the variables Department and Response are imputed jointly. Therefore, the missing values for Department will be imputed conditionally on the observed levels of Response, and the missing values for Response will be imputed conditionally on the observed levels of Department. Observations that contain missing values for both Department and Response will be imputed by using the joint observed levels of Department and Response. The OUT= option in the OUTPUT statement names a SAS data set to save the imputed data. The OUTJKCOEFS= option in the OUTPUT statement names a SAS data set to save the jackknife coefficients.

Summary information about the data, CLASS levels, and survey design is shown in Figure 110.1. The "Imputation Information" table summarizes the imputation information. The "Number of Observations" table displays the number of observations that PROC SURVEYIMPUTE reads and uses. This table also displays the sum of weights that are read and used. The sum of weights read (6,468) can be used as an estimator of the population size. For example, the 235 observation units in the SIS_Survey_Sub data set represent 6,468 teachers and administrative staff in the population. The "Class Level Information" table shows that Department has two levels and Response has five levels. The "Design Summary" table shows that 47 schools are selected in the sample from six strata.

Figure 110.1: Summary Information

The SURVEYIMPUTE Procedure

Imputation Information
Data Set	WORK.SIS_SURVEY_SUB
Weight Variable	SamplingWeight
Stratum Variables	State
	NewUser
Cluster Variable	School
Imputation Method	FEFI

Number of Observations Read	235
Number of Observations Used	235
Sum of Weights Read	6468
Sum of Weights Used	6468

Class Level Information
Class	Levels	Values
Department	2	0 1
Response	5	1 2 3 4 5

Design Summary
Number of Strata	6
Number of Clusters	47

The "Missing Data Patterns" table in Figure 110.2 lists distinct missing data patterns along with their corresponding frequencies and weighted percentages. An "X" means that the variable is observed in the corresponding group, and a "." means that the variable is missing. The table also displays group-specific variable means. In this hypothetical example, five respondents have unit nonresponse (both variables in the VAR statement contain missing values), 73 respondents have item nonresponse (only one variable in the VAR statement contains a missing value), and 157 respondents have complete response (no variables in the VAR statement contain missing values). Among the 73 item nonrespondents, for 52 respondents, Department is observed but Response is not observed; for 21 respondents, Response is observed but Department is not observed. The estimated percentages in the sample for unit nonresponse, item nonresponse, and complete response are 2.1%, 31.1%, and 66.8%, respectively.

Figure 110.2: Missing Data Patterns

Missing Data Patterns
Group	Department	Response	Freq	Sum of Weights	Unweighted Percent	Weighted Percent	Group Means
Group	Department	Response	Freq	Sum of Weights	Unweighted Percent	Weighted Percent	Department 0	Department 1	Response 1	Response 2	Response 3	Response 4	Response 5
1	X	X	157	4272	66.81	66.05	0.440309	0.559691	0.184457	0.206695	0.265684	0.209738	0.133427
2	X	.	52	1480	22.13	22.88	0.641892	0.358108	.	.	.	.	.
3	.	X	21	586	8.94	9.06	.	.	0.261092	0.235495	0.230375	0.085324	0.187713
4	.	.	5	130	2.13	2.01	.	.	.	.	.	.	.

The "Imputation Summary" table in Figure 110.3 lists the number of nonmissing observations, missing observations, and imputed observations. There are 78 observations that have missing values for at least one variable, and all 78 missing observations are imputed.

Figure 110.3: Imputation Summary

Imputation Summary
Observation Status	Number of Observations	Sum of Weights
Nonmissing	157	4272
Missing	78	2196
Missing, Imputed	78	2196
Missing, Not Imputed	0	0
Missing, Partially Imputed	0	0

The output data set SIS_Survey_Imputed contains the observed data and the imputed values for Department and Response. In addition, this data set contains the imputation-adjusted full-sample weight (ImpWt), observation unit identification (UnitId), recipient index (Recipient), and imputation-adjusted jackknife replicate weights (ImpRepWt_1, …, ImpRepWt_47).

Suppose you want to compute frequency tables by using the imputed data set. The following statements request one-way tables for Department and Response and a two-way table for Department by Response. The analyses include the imputed values and account for both the design variance and the imputation variance.

proc surveyfreq data=SIS_Survey_Imputed varmethod=jackknife;
   table department response  department*response;
   weight ImpWt;
   repweights ImpRepWt: / jkcoefs=SIS_JKCoefs;
run;

The DATA= option in the PROC SURVEYFREQ statement specifies the input data set for analysis, SIS_Survey_Imputed, which contains the observed values and the imputed values for Department and Response. The FEFI technique uses multiple donor cells for a missing item. Therefore, the number of rows in the SIS_Survey_Imputed data set is greater than the number of rows in the observed data set, SIS_Survey_Sub. Each row in the SIS_Survey_Sub data set represents an observation unit, but this is not true for the SIS_Survey_Imputed data set. Therefore, it is very important to use only the weighted statistics from SIS_Survey_Imputed. The WEIGHT statement specifies the weight variable ImpWt, which is adjusted for the FEFI method. The imputation-adjusted jackknife replicate weights are saved in the variables ImpRepWt_1, …, ImpRepWt_47 in the SIS_Survey_Imputed data set. The REPWEIGHTS statement names the replicate weight variables and the jackknife coefficients data set, SIS_JKCOEFS. You should not use the unadjusted full-sample weights (SamplingWeight) or unadjusted replicate weights along with the imputed data.

Figure 110.4 displays some summary information. Note that the sum of weights in Figure 110.4 matches the sum of weights read from Figure 110.1, but the number of observations in Figure 110.4 (509) does not match the number of observations from Figure 110.1 (235). The sum of weights from both PROC SURVEYIMPUTE and PROC SURVEYFREQ represents the population size. The number of observations in Figure 110.1 represents the number of observation units, but the number of observations in Figure 110.4 represents the number of rows in the data set that include the observed units and the imputed rows. The number of replicates is 47, which is the same as the number of schools (PSUs).

Figure 110.4: One-Way Table

The SURVEYFREQ Procedure

Data Summary
Number of Observations	509
Sum of Weights	6468

Variance Estimation
Method	Jackknife
Replicate Weights	SIS_SURVEY_IMPUTED
Number of Replicates	47

Figure 110.5 displays one-way tables for Department and Response. The Frequency column does not represent frequencies for observation units from the SIS_Survey_Sub data set. These frequencies represent the frequency of data lines in the SIS_Survey_Sub data set, which also include the imputed rows. The Weighted Frequency, Std Err of Wgt Freq, Percent, and Std Err of Percent columns use the imputation-adjusted full-sample weight and replicate weights. You should use the weighted statistics from these columns. For example, an estimated 49.47% of SIS users are teachers, with a standard error of 6.64%. An estimate of "Very Satisfied" users is 14.19%, with a standard error of 3.77%.

Figure 110.5: One-Way Table

Table of Department
Department	Frequency	Weighted Frequency	Std Err of Wgt Freq	Percent	Std Err of Percent
0	278	3200	429.52229	49.4729	6.6407
1	231	3268	429.52229	50.5271	6.6407
Total	509	6468	5.3369E-11	100.000

Table of Response
Response	Frequency	Weighted Frequency	Std Err of Wgt Freq	Percent	Std Err of Percent
1	100	1256	291.92305	19.4153	4.5133
2	103	1371	361.02585	21.1976	5.5817
3	112	1710	305.26968	26.4371	4.7197
4	100	1213	283.69298	18.7598	4.3861
5	94	917.82544	243.87967	14.1903	3.7706
Total	509	6468	3.2868E-11	100.000

Figure 110.6 displays the two-way table for Department by Response. The Weighted Frequency, Std Err of Wgt Freq, Percent, and Std Err of Percent columns use the imputation-adjusted full-sample weight and replicate weights. You should use the weighted statistics from these columns. Among the teachers, 8.10% are estimated to be "Very Satisfied," with a standard error of 3.11%. Among the administrators, 6.09% are "Very Satisfied," with a standard error of 2.43%.

Figure 110.6: Crosstabulation

Table of Department by Response
Department	Response	Frequency	Weighted Frequency	Std Err of Wgt Freq	Percent	Std Err of Percent
0	1	57	637.83724	246.18741	9.8614	3.8062
	2	55	743.50947	334.28335	11.4952	5.1683
	3	64	951.95811	258.23015	14.7180	3.9924
	4	49	342.84458	150.44168	5.3006	2.3259
	5	53	523.75680	200.99126	8.0977	3.1075
	Total	278	3200	429.52229	49.4729	6.6407
1	1	43	617.94159	209.04386	9.5538	3.2320
	2	48	627.55128	185.53346	9.7024	2.8685
	3	48	757.99401	237.82609	11.7191	3.6770
	4	51	870.53830	262.37514	13.4592	4.0565
	5	41	394.06863	156.95381	6.0926	2.4266
	Total	231	3268	429.52229	50.5271	6.6407
Total	1	100	1256	291.92305	19.4153	4.5133
	2	103	1371	361.02585	21.1976	5.5817
	3	112	1710	305.26968	26.4371	4.7197
	4	100	1213	283.69298	18.7598	4.3861
	5	94	917.82544	243.87967	14.1903	3.7706
	Total	509	6468	4.7965E-12	100.000