This example shows how you can use PROC SURVEYIMPUTE to impute missing values and compute imputation-adjusted statistics for sample survey data. The example uses simulated data from a customer satisfaction survey for a student information system (SIS), which is a software product that provides modules for student registration, class scheduling, attendance, grade reporting, and other functions.
The software company conducted a survey of school personnel who use the SIS. A probability sample of SIS users was selected from the study population, which included SIS users at middle schools and high schools in three states, Georgia, South Carolina, and North Carolina. The sample design for this survey was a two-stage stratified design. A first-stage sample of schools was selected from the list of schools in the three states that use the SIS. The list of schools, which are the primary sampling units (PSU), was stratified by state and by customer status (whether the school was a new user or a renewal user of the system). Within the strata, schools were selected with probability proportional to size and with replacement, where the size measure was school enrollment. From each sample school, five staff members were randomly selected with replacement as the second-stage units to complete the SIS satisfaction questionnaire. These staff members include both teachers and administrators.
The SAS data set SIS_Survey_Sub
contains the survey results and the sample design information that is needed to analyze the data. The data set contains the
following items:
State
: state where the school is located
NewUser
: 1 if the school is a new user of SIS or 0 if not
School
: school identification (PSU)
SamplingWeight
: sampling weight
Department
: 0 for teachers and 1 for administrators
Response
: coded from 1 to 5, where 1 represents "Very Unsatisfied" and 5 represents "Very Satisfied"
The following statements request the imputation of missing values for Department
and Response
by using the fully efficient fractional imputation (FEFI) method:
proc surveyimpute data=SIS_Survey_Sub method=fefi varmethod=jackknife; class Department Response; var Department Response; strata State NewUser; cluster School; weight SamplingWeight; output out=SIS_Survey_Imputed outjkcoefs=SIS_JKCoefs; run;
The PROC SURVEYIMPUTE statement invokes the procedure. The DATA= option in the PROC SURVEYIMPUTE statement specifies the input
data set containing the missing values, the METHOD=FEFI option requests the fully efficient fractional imputation method,
and the VARMETHOD= option requests the imputation-adjusted jackknife replicate weights. The CLASS statement specifies the
classification variables. The STRATA, CLUSTER, and WEIGHT statements specify the strata, clusters (PSUs), and weight variables.
The VAR statement specifies the variables to be imputed (Department
and Response
). By default, both the variables Department
and Response
are imputed jointly. Therefore, the missing values for Department
will be imputed conditionally on the observed levels of Response
, and the missing values for Response
will be imputed conditionally on the observed levels of Department
. Observations that contain missing values for both Department
and Response
will be imputed by using the joint observed levels of Department
and Response
. The OUT= option in the OUTPUT statement names a SAS data set to save the imputed data. The OUTJKCOEFS= option in the OUTPUT
statement names a SAS data set to save the jackknife coefficients.
Summary information about the data, CLASS levels, and survey design is shown in Figure 110.1. The "Imputation Information" table summarizes the imputation information. The "Number of Observations" table displays the
number of observations that PROC SURVEYIMPUTE reads and uses. This table also displays the sum of weights that are read and
used. The sum of weights read (6,468) can be used as an estimator of the population size. For example, the 235 observation
units in the SIS_Survey_Sub
data set represent 6,468 teachers and administrative staff in the population. The "Class Level Information" table shows that
Department
has two levels and Response
has five levels. The "Design Summary" table shows that 47 schools are selected in the sample from six strata.
Figure 110.1: Summary Information
The "Missing Data Patterns" table in Figure 110.2 lists distinct missing data patterns along with their corresponding frequencies and weighted percentages. An "X" means that
the variable is observed in the corresponding group, and a "." means that the variable is missing. The table also displays
group-specific variable means. In this hypothetical example, five respondents have unit nonresponse (both variables in the
VAR statement contain missing values), 73 respondents have item nonresponse (only one variable in the VAR statement contains
a missing value), and 157 respondents have complete response (no variables in the VAR statement contain missing values). Among
the 73 item nonrespondents, for 52 respondents, Department
is observed but Response
is not observed; for 21 respondents, Response
is observed but Department
is not observed. The estimated percentages in the sample for unit nonresponse, item nonresponse, and complete response are
2.1%, 31.1%, and 66.8%, respectively.
Figure 110.2: Missing Data Patterns
Missing Data Patterns | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Group | Department | Response | Freq | Sum of Weights |
Unweighted Percent |
Weighted Percent |
Group Means | ||||||
Department 0 | Department 1 | Response 1 | Response 2 | Response 3 | Response 4 | Response 5 | |||||||
1 | X | X | 157 | 4272 | 66.81 | 66.05 | 0.440309 | 0.559691 | 0.184457 | 0.206695 | 0.265684 | 0.209738 | 0.133427 |
2 | X | . | 52 | 1480 | 22.13 | 22.88 | 0.641892 | 0.358108 | . | . | . | . | . |
3 | . | X | 21 | 586 | 8.94 | 9.06 | . | . | 0.261092 | 0.235495 | 0.230375 | 0.085324 | 0.187713 |
4 | . | . | 5 | 130 | 2.13 | 2.01 | . | . | . | . | . | . | . |
The "Imputation Summary" table in Figure 110.3 lists the number of nonmissing observations, missing observations, and imputed observations. There are 78 observations that have missing values for at least one variable, and all 78 missing observations are imputed.
Figure 110.3: Imputation Summary
The output data set SIS_Survey_Imputed
contains the observed data and the imputed values for Department
and Response
. In addition, this data set contains the imputation-adjusted full-sample weight (ImpWt
), observation unit identification (UnitId
), recipient index (Recipient
), and imputation-adjusted jackknife replicate weights (ImpRepWt_1
, …, ImpRepWt_47
).
Suppose you want to compute frequency tables by using the imputed data set. The following statements request one-way tables
for Department
and Response
and a two-way table for Department
by Response
. The analyses include the imputed values and account for both the design variance and the imputation variance.
proc surveyfreq data=SIS_Survey_Imputed varmethod=jackknife; table department response department*response; weight ImpWt; repweights ImpRepWt: / jkcoefs=SIS_JKCoefs; run;
The DATA= option in the PROC SURVEYFREQ statement specifies the input data set for analysis, SIS_Survey_Imputed
, which contains the observed values and the imputed values for Department
and Response
. The FEFI technique uses multiple donor cells for a missing item. Therefore, the number of rows in the SIS_Survey_Imputed
data set is greater than the number of rows in the observed data set, SIS_Survey_Sub
. Each row in the SIS_Survey_Sub
data set represents an observation unit, but this is not true for the SIS_Survey_Imputed
data set. Therefore, it is very important to use only the weighted statistics from SIS_Survey_Imputed
. The WEIGHT statement specifies the weight variable ImpWt
, which is adjusted for the FEFI method. The imputation-adjusted jackknife replicate weights are saved in the variables ImpRepWt_1
, …, ImpRepWt_47
in the SIS_Survey_Imputed
data set. The REPWEIGHTS statement names the replicate weight variables and the jackknife coefficients data set, SIS_JKCOEFS
. You should not use the unadjusted full-sample weights (SamplingWeight
) or unadjusted replicate weights along with the imputed data.
Figure 110.4 displays some summary information. Note that the sum of weights in Figure 110.4 matches the sum of weights read from Figure 110.1, but the number of observations in Figure 110.4 (509) does not match the number of observations from Figure 110.1 (235). The sum of weights from both PROC SURVEYIMPUTE and PROC SURVEYFREQ represents the population size. The number of observations in Figure 110.1 represents the number of observation units, but the number of observations in Figure 110.4 represents the number of rows in the data set that include the observed units and the imputed rows. The number of replicates is 47, which is the same as the number of schools (PSUs).
Figure 110.4: One-Way Table
Figure 110.5 displays one-way tables for Department
and Response
. The Frequency column does not represent frequencies for observation units from the SIS_Survey_Sub
data set. These frequencies represent the frequency of data lines in the SIS_Survey_Sub
data set, which also include the imputed rows. The Weighted Frequency, Std Err of Wgt Freq, Percent, and Std Err of Percent
columns use the imputation-adjusted full-sample weight and replicate weights. You should use the weighted statistics from
these columns. For example, an estimated 49.47% of SIS users are teachers, with a standard error of 6.64%. An estimate of
"Very Satisfied" users is 14.19%, with a standard error of 3.77%.
Figure 110.5: One-Way Table
Table of Response | |||||
---|---|---|---|---|---|
Response | Frequency | Weighted Frequency |
Std Err of Wgt Freq |
Percent | Std Err of Percent |
1 | 100 | 1256 | 291.92305 | 19.4153 | 4.5133 |
2 | 103 | 1371 | 361.02585 | 21.1976 | 5.5817 |
3 | 112 | 1710 | 305.26968 | 26.4371 | 4.7197 |
4 | 100 | 1213 | 283.69298 | 18.7598 | 4.3861 |
5 | 94 | 917.82544 | 243.87967 | 14.1903 | 3.7706 |
Total | 509 | 6468 | 3.2868E-11 | 100.000 |
Figure 110.6 displays the two-way table for Department
by Response
. The Weighted Frequency, Std Err of Wgt Freq, Percent, and Std Err of Percent columns use the imputation-adjusted full-sample
weight and replicate weights. You should use the weighted statistics from these columns. Among the teachers, 8.10% are estimated
to be "Very Satisfied," with a standard error of 3.11%. Among the administrators, 6.09% are "Very Satisfied," with a standard
error of 2.43%.
Figure 110.6: Crosstabulation
Table of Department by Response | ||||||
---|---|---|---|---|---|---|
Department | Response | Frequency | Weighted Frequency |
Std Err of Wgt Freq |
Percent | Std Err of Percent |
0 | 1 | 57 | 637.83724 | 246.18741 | 9.8614 | 3.8062 |
2 | 55 | 743.50947 | 334.28335 | 11.4952 | 5.1683 | |
3 | 64 | 951.95811 | 258.23015 | 14.7180 | 3.9924 | |
4 | 49 | 342.84458 | 150.44168 | 5.3006 | 2.3259 | |
5 | 53 | 523.75680 | 200.99126 | 8.0977 | 3.1075 | |
Total | 278 | 3200 | 429.52229 | 49.4729 | 6.6407 | |
1 | 1 | 43 | 617.94159 | 209.04386 | 9.5538 | 3.2320 |
2 | 48 | 627.55128 | 185.53346 | 9.7024 | 2.8685 | |
3 | 48 | 757.99401 | 237.82609 | 11.7191 | 3.6770 | |
4 | 51 | 870.53830 | 262.37514 | 13.4592 | 4.0565 | |
5 | 41 | 394.06863 | 156.95381 | 6.0926 | 2.4266 | |
Total | 231 | 3268 | 429.52229 | 50.5271 | 6.6407 | |
Total | 1 | 100 | 1256 | 291.92305 | 19.4153 | 4.5133 |
2 | 103 | 1371 | 361.02585 | 21.1976 | 5.5817 | |
3 | 112 | 1710 | 305.26968 | 26.4371 | 4.7197 | |
4 | 100 | 1213 | 283.69298 | 18.7598 | 4.3861 | |
5 | 94 | 917.82544 | 243.87967 | 14.1903 | 3.7706 | |
Total | 509 | 6468 | 4.7965E-12 | 100.000 |