Regression with Measurement Errors

In this section, you start with a linear regression model and learn how the regression equation can be specified in PROC CALIS. The regression model is then extended to include measurement errors in the predictors and in the outcome variables. Problems with model identification are introduced.

Simple Linear Regression

Consider fitting a linear equation to two observed variables, and . Simple linear regression uses the following model form:

     

The model makes the following assumption:

     

The parameters and are the intercept and regression coefficient, respectively, and is an error term. If the values of are fixed, the values of are assumed to be independent and identically distributed realizations of a normally distributed random variable with mean zero and variance Var(). If is a random variable, and are assumed to have a bivariate normal distribution with zero correlation and variances Var() and Var(), respectively. Under either set of assumptions, the usual formulas hold for the estimates of the intercept and regression coefficient and their standard errors. (See Chapter 4, Introduction to Regression Procedures. )

In the REG procedure, you can fit a simple linear regression model with a MODEL statement that lists only the names of the manifest variables, as shown in the following statements:

proc reg;
   model Y = X;
run;

You can also fit this model with PROC CALIS, but the syntax is different. You can specify the simple linear regression model in PROC CALIS by using the LINEQS modeling language, as shown in the following statements:

proc calis;
   lineqs
      Y = beta * X + Ey;
run;

LINEQS stands for "LINear EQuationS." You invoke the LINEQS modeling language by using the LINEQS statement in PROC CALIS. In the LINEQS statement, you specify the linear equations of your model. The LINEQS statement syntax is similar to the mathematical equation that you would write for the model. An obvious difference between the LINEQS and the PROC REG model specification is that in LINEQS you can name the parameter involved (for example, beta) and you also specify the error term explicitly. The additional syntax required by the LINEQS statement seems to make the model specification more time-consuming and cumbersome. However, this inconvenience is minor and is offset by the modeling flexibility of the LINEQS modeling language (and of PROC CALIS, generally). As you proceed to more examples in this chapter, you will find the benefits of specifying parameter names for more complicated models with constraints. You will also find that specifying parameter names for unconstrained parameters is optional. Using parameter names in the current example is for the ease of reference in the current discussion.

You might wonder whether an intercept term is missing in the LINEQS statement and where you should put the intercept term if you want to specify it. The intercept term, which is considered as a mean structure parameter in the context of structural equation modeling, is usually omitted when statistical inferences can be drawn from analyzing the covariance structures alone. However, this does not mean that the regression equation has a default fixed-zero intercept in the LINEQS specification. Rather, it means only that the mean structures are saturated and are not estimated in the covariance structure model. Therefore, in the preceding LINEQS specification, the intercept term is implicitly assumed in the model. It is not of primary interest and is not estimated.

However, if you want to estimate the intercept, you can specify it in the LINEQS equations, as shown in the following specification:

proc calis;
   lineqs 
      Y = alpha * Intercept + beta * X + Ey;
run;

In this LINEQS statement, alpha represents the intercept parameter and intercept represents an internal "variable" that has a fixed value of 1 for each observation. With this specification, an estimate of is displayed in the PROC CALIS output results. However, estimation results for other parameters are the same as those from the specification without the intercept term. For this reason the intercept term is not specified in the examples of this section.

Errors-in-Variables Regression

For ordinary unconstrained regression models, there is no reason to use PROC CALIS instead of PROC REG. But suppose that the predictor variable is a random variable that is contaminated by errors (especially measurement errors), and you want to estimate the linear relationship between the true, error-free scores. The following model takes this kind of measurement errors into account:

     
     

The model assumes the following:

     

There are two equations in the model. The first one is the so-called structural model, which describes the relationships between and the true score predictor . This equation is your main interest. However, is a latent variable that has not been observed. Instead, what you have observed for this predictor is , which is the contaminated version of with measurement error or other errors, denoted by , added. This measurement process is described in the second equation, or the so-called measurement model. By analyzing the structural and measurement models (or the two linear equations) simultaneously, you want to estimate the true score effect .

The assumption that the error terms and and the latent variable are jointly uncorrelated is of critical importance in the model. This assumption must be justified on substantive grounds such as the physical properties of the measurement process. If this assumption is violated, the estimators might be severely biased and inconsistent.

You can express the current errors-in-variables model by the LINEQS modeling language as shown in the following statements:

proc calis;
   lineqs
      Y = beta * Fx + Ey,
      X = 1.   * Fx + Ex;
run;

In this specification, you need to specify only the equations involved without specifying the assumptions about the correlations among Fx, Ey, and Ex. In the LINEQS modeling language, you should always name latent factors with the 'F' or 'f' prefix (for example, Fx) and error terms with the 'E' or 'e' prefix (for example, Ey and Ex). Given this LINEQS notation, latent factors and error terms, by default, are uncorrelated in the model.

Consider an example of an errors-in-variables regression model. Fuller (1987, pp. 18–19) analyzes a data set from Voss (1969) that involves corn yields () and available soil nitrogen () for which there is a prior estimate of the measurement error for soil nitrogen Var() of 57. The scientific question is: how does nitrogen affect corn yields? The linear prediction of corn yields by nitrogen should be based on a measure of nitrogen that is not contaminated with measurement error. Hence, the errors-in-variables model is applied. in the model represents the "true" nitrogen measure, represents the observed measure of nitrogen, which has a true score component and an error component . Given that the measurement error for soil nitrogen Var() is 57, you can specify the errors-in-variables regression model with the following statements in PROC CALIS:

data corn(type=cov);
   input _type_ $ _name_ $ y x;
   datalines;
cov    y    87.6727    .
cov    x    104.8818   304.8545
mean   .    97.4545    70.6364
n      .    11         11
;
proc calis data=corn;
   lineqs
      Y  = beta * Fx + Ey,
      X  = 1.   * Fx + Ex;
   variance
      Ex = 57.;
run;

In the VARIANCE statement, the variance of Ex (measurement error for X) is given as the constant value 57. PROC CALIS produces the estimates shown in Figure 17.4.

Figure 17.4 Errors-in-Variables Model for Corn Data
Linear Equations
y =   0.4232 * Fx + 1.0000   Ey
Std Err     0.1658   beta        
t Value     2.5520            
x =   1.0000   Fx + 1.0000   Ex

Estimates for Variances of Exogenous Variables
Variable
Type
Variable Parameter Estimate Standard
Error
t Value
Error Ex   57.00000    
Latent Fx _Add1 247.85450 136.33508 1.81798
Error Ey _Add2 43.29105 23.92488 1.80946

In Figure 17.4, the estimate of beta is 0.4232 with a standard error estimate of 0.1658. The value is 2.552. It is significant at the 0.05 -level when compared to the critical value of the standard normal variate (that is, the table). Also shown in Figure 17.4 are the estimated variances of Fx, Ey, and their estimated standard errors. The names of these parameters have the prefix '_Add'. They are added by PROC CALIS as default parameters. By employing some conventional rules for setting default parameters, PROC CALIS makes your model specification much easier and concise. For example, you do not need to specify each error variance parameter manually if it is not constrained in the model. However, you can specify these parameters explicitly if you desire. Note that in Figure 17.4, the variance of Ex is shown to be without a standard error estimate because it is a fixed constant in the model.

What if you did not model the measurement error in the predictor ? That is, what is the estimate of beta if you use ordinary regression of on , as described by the equation in the section Simple Linear Regression? You can specify such a linear regression model easily by the LINEQS modeling language. Here, you specify this linear regression model as a special case of the errors-in-variables model. That is, you constrain the variance of measurement error to 0 in the preceding LINEQS model specification to form the linear regression model, as shown in the following statements:

proc calis data=corn;
   lineqs
      Y  = beta * Fx + Ey,
      X  = 1. *   Fx + Ex;
   variance
      Ex = 0.;
run;

Fixing the variance of Ex to zero forces the equality of and in the measurement model so that this "new" errors-in-variables model is in fact an ordinary regression model. PROC CALIS produces the estimation results in Figure 17.5.

Figure 17.5 Ordinary Regression Model for Corn Data: Zero Measurement Error in X
Linear Equations
y =   0.3440 * Fx + 1.0000   Ey
Std Err     0.1301   beta        
t Value     2.6447            
x =   1.0000   Fx + 1.0000   Ex

Estimates for Variances of Exogenous Variables
Variable
Type
Variable Parameter Estimate Standard
Error
t Value
Error Ex   0    
Latent Fx _Add1 304.85450 136.33508 2.23607
Error Ey _Add2 51.58928 23.07143 2.23607

The estimate of beta is now 0.3440, which is an underestimate of the effect of nitrogen on corn yields given the presence of nonzero measurement error in , where the estimate of beta is 0.4232.

Regression with Measurement Errors in and

What if there are also measurement errors in the outcome variable ? How can you write such an extended model? The following model would take measurement errors in both and into account:

     
     
     

with the following assumption:

     
     
     

Again, the first equation, expressing the relationship between two latent true-score variables, defines the structural or causal model. The next two equations express the observed variables in terms of a true score plus error; these two equations define the measurement model. This is essentially the same form as the so-called LISREL model (Keesling; 1972; Wiley; 1973; Jöreskog; 1973), which has been popularized by the LISREL program (Jöreskog and Sörbom; 1988). Typically, there are several and variables in a LISREL model. For the moment, however, the focus is on the current regression form in which there is only a single predictor and a single outcome variable. The LISREL model is considered in the section Fitting LISREL Models by the LISMOD Modeling Language.

With the intercept term left out for modeling, you can use the following statements for fitting the regression model with measurement errors in both and :

proc calis data=corn;
   lineqs
      Fy  = beta * Fx + DFy,
      Y   = 1.   * Fy + Ey,
      X   = 1.   * Fx + Ex;
run;

Again, you do not need to specify the zero-correlation assumptions in the LINEQS model because they are set by default given the latent factors and errors in the LINEQS modeling language. When you run this model, PROC CALIS issues the following warning:

   WARNING: Estimation problem not identified: More parameters to 
            estimate ( 5 ) than the total number of mean and 
            covariance elements ( 3 ).

The five parameters in the model include beta and the variances for the exogenous variables: Fx, DFy, Ey, and Ex. These variance parameters are treated as free parameters by default in PROC CALIS. You have five parameters to estimate, but the information for estimating these five parameters comes from the three unique elements in the sample covariance matrix for and . Hence, your model is in the so-called underidentification situation. Model identification is discussed in more detail in the section Model Identification.

To make the current model identified, you can put constraints on some parameters. This reduces the number of independent parameters to estimate in the model. In the errors-in-variables model for the corn data, the variance of Ex (measurement error for X) is given as the constant value 57, which was obtained from a previous study. This could still be applied in the current model with measurement errors in both and . In addition, if you are willing to accept the assumption that the structural equation model is (almost) deterministic, then the variance of Dfy could be set to 0. With these two parameter constraints, the current model is just-identified. That is, you can now estimate three free parameters from three distinct covariance elements in the data. The following statements show the LINEQS model specification for this just-identified model:

proc calis data=corn;
   lineqs
      Fy  = beta * Fx + Dfy,
      Y   = 1.   * Fy + Ey,
      X   = 1.   * Fx + Ex;
   variance
      Ex  = 57.,
      Dfy = 0.;
run;


Figure 17.6 shows the estimation results.

Figure 17.6 Regression Model With Measurement Errors in X and Y for Corn Data
Linear Equations
Fy =   0.4232 * Fx + 1.0000   Dfy
Std Err     0.1658   beta        
t Value     2.5520            
y =   1.0000   Fy + 1.0000   Ey
x =   1.0000   Fx + 1.0000   Ex

Estimates for Variances of Exogenous Variables
Variable
Type
Variable Parameter Estimate Standard
Error
t Value
Error Ex   57.00000    
Disturbance Dfy   0    
Latent Fx _Add1 247.85450 136.33508 1.81798
Error Ey _Add2 43.29105 23.92488 1.80946

In Figure 17.6, the estimate of beta is 0.4232, which is basically the same as the estimate for beta in the errors-in-variables model shown in Figure 17.4. The estimated variances for Fx and Ey match for the two models too. In fact, it is not difficult to show mathematically that the current constrained model with measurements errors in both and is equivalent to the errors-in-variables model for the corn data. The numerical results merely confirm this fact.

It is important to emphasize that the equivalence shown here is not a general statement about the current model with measurement errors in and and the errors-in-variables model. Essentially, the equivalence of the two models as applied to the corn data is due to those constraints imposed on the measurement error variances for DFy and Ex. The more important implication from these two analyses is that for the model with measurement errors in both and , you need to set more parameter constraints to make the model identified. Some constraints might be substantively meaningful, while others might need strong or risky assumptions.

For example, setting the variance of Ex to 57 is substantively meaningful because it is based on a prior study. However, setting the variance of Dfy to 0 implies the acceptance of the deterministic structural model, which could be a rather risky assumption in most practical situations. It turns out that using these two constraints together for the model identification of the regression with measurement errors in both and does not give you more substantively important information than what the errors-in-variables model has already given you (compare Figure 17.6 with Figure 17.4). Therefore, the set of identification constraints you use might be important in at least two aspects. First, it might lead to an identified model if you set them properly. Second, given that the model is identified, the meaningfulness of your model depends on how reasonable your identification constraints are.

The two identification constraints set on the regression model with measurement errors in both and make the model identified. But they do not lead to model estimates that are more informative than that of the errors-in-variables regression. Some other sets of identification constraints, if available, might have been more informative. For example, if there were a prior study about the measurement error variance of corn yields (), a fixed constant for the variance of Ey could have been set, instead of the unrealistic zero variance constraint of Dfy. This way the estimation results of the regression model with measurement errors in both and would offer you something different from the errors-in-variables regression.

Setting identification constraints could be based on convention or other arguments. See the section Illustration of Model Identification: Spleen Data for an example where model identification is attained by setting constant error variances for and in the model. For the corn data, you have seen that fixing the error variance of the predictor variable led to model identification of the errors-in-variables model. In this case, prior knowledge about the measurement error variance is necessary. This necessity is partly due to the fact that each latent true score variable has only one observed variable as its indicator measure. When you have more measurement indicators for the same latent factor, fixing the measurement error variances to constants for model identification would not be necessary. This is the modeling scenario assumed by the LISREL model (see the section Fitting LISREL Models by the LISMOD Modeling Language), of which the confirmatory factor model is a special case. The confirmatory factor model is described and illustrated in the section The FACTOR and RAM Modeling Languages.