The SURVEYIMPUTE Procedure

Getting Started: SURVEYIMPUTE Procedure

This example shows how you can use PROC SURVEYIMPUTE to impute missing values and compute imputation-adjusted statistics for sample survey data. The example uses simulated data from a customer satisfaction survey for a student information system (SIS), which is a software product that provides modules for student registration, class scheduling, attendance, grade reporting, and other functions.

The software company conducted a survey of school personnel who use the SIS. A probability sample of SIS users was selected from the study population, which included SIS users at middle schools and high schools in three states, Georgia, South Carolina, and North Carolina. The sample design for this survey was a two-stage stratified design. A first-stage sample of schools was selected from the list of schools in the three states that use the SIS. The list of schools, which are the primary sampling units (PSU), was stratified by state and by customer status (whether the school was a new user or a renewal user of the system). Within the strata, schools were selected with probability proportional to size and with replacement, where the size measure was school enrollment. From each sample school, five staff members were randomly selected with replacement as the second-stage units to complete the SIS satisfaction questionnaire. These staff members include both teachers and administrators.

The SAS data set SIS_Survey_Sub contains the survey results and the sample design information that is needed to analyze the data. The data set contains the following items:

  • State: state where the school is located

  • NewUser: 1 if the school is a new user of SIS or 0 if not

  • School: school identification (PSU)

  • SamplingWeight: sampling weight

  • Department: 0 for teachers and 1 for administrators

  • Response: coded from 1 to 5, where 1 represents "Very Unsatisfied" and 5 represents "Very Satisfied"

The following statements request the imputation of missing values for Department and Response by using the fully efficient fractional imputation (FEFI) method:

proc surveyimpute data=SIS_Survey_Sub method=fefi varmethod=jackknife;
   class Department Response;
   var Department Response;
   strata State NewUser;
   cluster School;
   weight SamplingWeight;
   output out=SIS_Survey_Imputed outjkcoefs=SIS_JKCoefs;
run;

The PROC SURVEYIMPUTE statement invokes the procedure. The DATA= option in the PROC SURVEYIMPUTE statement specifies the input data set containing the missing values, the METHOD=FEFI option requests the fully efficient fractional imputation method, and the VARMETHOD= option requests the imputation-adjusted jackknife replicate weights. The CLASS statement specifies the classification variables. The STRATA, CLUSTER, and WEIGHT statements specify the strata, clusters (PSUs), and weight variables. The VAR statement specifies the variables to be imputed (Department and Response). By default, both the variables Department and Response are imputed jointly. Therefore, the missing values for Department will be imputed conditionally on the observed levels of Response, and the missing values for Response will be imputed conditionally on the observed levels of Department. Observations that contain missing values for both Department and Response will be imputed by using the joint observed levels of Department and Response. The OUT= option in the OUTPUT statement names a SAS data set to save the imputed data. The OUTJKCOEFS= option in the OUTPUT statement names a SAS data set to save the jackknife coefficients.

Summary information about the data, CLASS levels, and survey design is shown in Figure 110.1. The "Imputation Information" table summarizes the imputation information. The "Number of Observations" table displays the number of observations that PROC SURVEYIMPUTE reads and uses. This table also displays the sum of weights that are read and used. The sum of weights read (6,468) can be used as an estimator of the population size. For example, the 235 observation units in the SIS_Survey_Sub data set represent 6,468 teachers and administrative staff in the population. The "Class Level Information" table shows that Department has two levels and Response has five levels. The "Design Summary" table shows that 47 schools are selected in the sample from six strata.

Figure 110.1: Summary Information

The SURVEYIMPUTE Procedure

Imputation Information
Data Set WORK.SIS_SURVEY_SUB
Weight Variable SamplingWeight
Stratum Variables State
  NewUser
Cluster Variable School
Imputation Method FEFI

Number of Observations Read 235
Number of Observations Used 235
Sum of Weights Read 6468
Sum of Weights Used 6468

Class Level Information
Class Levels Values
Department 2 0 1
Response 5 1 2 3 4 5

Design Summary
Number of Strata 6
Number of Clusters 47



The "Missing Data Patterns" table in Figure 110.2 lists distinct missing data patterns along with their corresponding frequencies and weighted percentages. An "X" means that the variable is observed in the corresponding group, and a "." means that the variable is missing. The table also displays group-specific variable means. In this hypothetical example, five respondents have unit nonresponse (both variables in the VAR statement contain missing values), 73 respondents have item nonresponse (only one variable in the VAR statement contains a missing value), and 157 respondents have complete response (no variables in the VAR statement contain missing values). Among the 73 item nonrespondents, for 52 respondents, Department is observed but Response is not observed; for 21 respondents, Response is observed but Department is not observed. The estimated percentages in the sample for unit nonresponse, item nonresponse, and complete response are 2.1%, 31.1%, and 66.8%, respectively.

Figure 110.2: Missing Data Patterns

Missing Data Patterns
Group Department Response Freq Sum of
Weights
Unweighted
Percent
Weighted
Percent
Group Means
Department 0 Department 1 Response 1 Response 2 Response 3 Response 4 Response 5
1 X X 157 4272 66.81 66.05 0.440309 0.559691 0.184457 0.206695 0.265684 0.209738 0.133427
2 X . 52 1480 22.13 22.88 0.641892 0.358108 . . . . .
3 . X 21 586 8.94 9.06 . . 0.261092 0.235495 0.230375 0.085324 0.187713
4 . . 5 130 2.13 2.01 . . . . . . .



The "Imputation Summary" table in Figure 110.3 lists the number of nonmissing observations, missing observations, and imputed observations. There are 78 observations that have missing values for at least one variable, and all 78 missing observations are imputed.

Figure 110.3: Imputation Summary

Imputation Summary
Observation Status Number of
Observations
Sum of
Weights
Nonmissing 157 4272
Missing 78 2196
Missing, Imputed 78 2196
Missing, Not Imputed 0 0
Missing, Partially Imputed 0 0



The output data set SIS_Survey_Imputed contains the observed data and the imputed values for Department and Response. In addition, this data set contains the imputation-adjusted full-sample weight (ImpWt), observation unit identification (UnitId), recipient index (Recipient), and imputation-adjusted jackknife replicate weights (ImpRepWt_1, …, ImpRepWt_47).

Suppose you want to compute frequency tables by using the imputed data set. The following statements request one-way tables for Department and Response and a two-way table for Department by Response. The analyses include the imputed values and account for both the design variance and the imputation variance.

proc surveyfreq data=SIS_Survey_Imputed varmethod=jackknife;
   table department response  department*response;
   weight ImpWt;
   repweights ImpRepWt: / jkcoefs=SIS_JKCoefs;
run;

The DATA= option in the PROC SURVEYFREQ statement specifies the input data set for analysis, SIS_Survey_Imputed, which contains the observed values and the imputed values for Department and Response. The FEFI technique uses multiple donor cells for a missing item. Therefore, the number of rows in the SIS_Survey_Imputed data set is greater than the number of rows in the observed data set, SIS_Survey_Sub. Each row in the SIS_Survey_Sub data set represents an observation unit, but this is not true for the SIS_Survey_Imputed data set. Therefore, it is very important to use only the weighted statistics from SIS_Survey_Imputed. The WEIGHT statement specifies the weight variable ImpWt, which is adjusted for the FEFI method. The imputation-adjusted jackknife replicate weights are saved in the variables ImpRepWt_1, …, ImpRepWt_47 in the SIS_Survey_Imputed data set. The REPWEIGHTS statement names the replicate weight variables and the jackknife coefficients data set, SIS_JKCOEFS. You should not use the unadjusted full-sample weights (SamplingWeight) or unadjusted replicate weights along with the imputed data.

Figure 110.4 displays some summary information. Note that the sum of weights in Figure 110.4 matches the sum of weights read from Figure 110.1, but the number of observations in Figure 110.4 (509) does not match the number of observations from Figure 110.1 (235). The sum of weights from both PROC SURVEYIMPUTE and PROC SURVEYFREQ represents the population size. The number of observations in Figure 110.1 represents the number of observation units, but the number of observations in Figure 110.4 represents the number of rows in the data set that include the observed units and the imputed rows. The number of replicates is 47, which is the same as the number of schools (PSUs).

Figure 110.4: One-Way Table

The SURVEYFREQ Procedure

Data Summary
Number of Observations 509
Sum of Weights 6468

Variance Estimation
Method Jackknife
Replicate Weights SIS_SURVEY_IMPUTED
Number of Replicates 47



Figure 110.5 displays one-way tables for Department and Response. The Frequency column does not represent frequencies for observation units from the SIS_Survey_Sub data set. These frequencies represent the frequency of data lines in the SIS_Survey_Sub data set, which also include the imputed rows. The Weighted Frequency, Std Err of Wgt Freq, Percent, and Std Err of Percent columns use the imputation-adjusted full-sample weight and replicate weights. You should use the weighted statistics from these columns. For example, an estimated 49.47% of SIS users are teachers, with a standard error of 6.64%. An estimate of "Very Satisfied" users is 14.19%, with a standard error of 3.77%.

Figure 110.5: One-Way Table

Table of Department
Department Frequency Weighted
Frequency
Std Err of
Wgt Freq
Percent Std Err of
Percent
0 278 3200 429.52229 49.4729 6.6407
1 231 3268 429.52229 50.5271 6.6407
Total 509 6468 5.3369E-11 100.000  

Table of Response
Response Frequency Weighted
Frequency
Std Err of
Wgt Freq
Percent Std Err of
Percent
1 100 1256 291.92305 19.4153 4.5133
2 103 1371 361.02585 21.1976 5.5817
3 112 1710 305.26968 26.4371 4.7197
4 100 1213 283.69298 18.7598 4.3861
5 94 917.82544 243.87967 14.1903 3.7706
Total 509 6468 3.2868E-11 100.000  



Figure 110.6 displays the two-way table for Department by Response. The Weighted Frequency, Std Err of Wgt Freq, Percent, and Std Err of Percent columns use the imputation-adjusted full-sample weight and replicate weights. You should use the weighted statistics from these columns. Among the teachers, 8.10% are estimated to be "Very Satisfied," with a standard error of 3.11%. Among the administrators, 6.09% are "Very Satisfied," with a standard error of 2.43%.

Figure 110.6: Crosstabulation

Table of Department by Response
Department Response Frequency Weighted
Frequency
Std Err of
Wgt Freq
Percent Std Err of
Percent
0 1 57 637.83724 246.18741 9.8614 3.8062
  2 55 743.50947 334.28335 11.4952 5.1683
  3 64 951.95811 258.23015 14.7180 3.9924
  4 49 342.84458 150.44168 5.3006 2.3259
  5 53 523.75680 200.99126 8.0977 3.1075
  Total 278 3200 429.52229 49.4729 6.6407
1 1 43 617.94159 209.04386 9.5538 3.2320
  2 48 627.55128 185.53346 9.7024 2.8685
  3 48 757.99401 237.82609 11.7191 3.6770
  4 51 870.53830 262.37514 13.4592 4.0565
  5 41 394.06863 156.95381 6.0926 2.4266
  Total 231 3268 429.52229 50.5271 6.6407
Total 1 100 1256 291.92305 19.4153 4.5133
  2 103 1371 361.02585 21.1976 5.5817
  3 112 1710 305.26968 26.4371 4.7197
  4 100 1213 283.69298 18.7598 4.3861
  5 94 917.82544 243.87967 14.1903 3.7706
  Total 509 6468 4.7965E-12 100.000