The QLIM Procedure

Endogeneity and Instrumental Variables

Subsections:

Test for Endogeneity
Overidentification Test

The PROC QLIM models such as qualitative response or limited dependent variable models assume that the errors are independent of the explanatory variables. If this assumption fails to hold, the distributional form that the likelihood is based on is misspecified and the obtained coefficients are inconsistent.

To begin, consider a linear model

$\begin{equation*} y_{i} = y_{i}^{*} = \beta _{0} + \beta _{1}x_{1i} + \cdots + \beta _{k}x_{ki} + u_{i} \end{equation*}$

Assume that $E(u) = 0$ , $\mbox{Cov}(x_{j}, u) = 0$ for $j = 1, \ldots , k-1$ , and $\mbox{Cov}(x_{k}, u) = \rho \neq 0$ . Therefore, $x_{k}$ is endogenous. The endogeneity comes from many sources, such as $x_{k}$ having measurement error or omitting a variable that is correlated with $x_{k}$ . If you ignore the endogeneity, you can estimate this model in PROC QLIM as follows (assuming $k=4$ ):

   proc qlim data=a;
      model y = x1 x2 x3 x4;
   run;

However, this approach produces inconsistent maximum likelihood estimates. To obtain consistent maximum likelihood estimates, you should consider the joint density of the dependent variable and the endogenous variables. To do this in PROC QLIM, you need at least one instrument—that is, an observable variable, $z_{1}$ —that is not in the structural equation and that satisfies two conditions: $z_{1}$ is exogenous (that is, $\mbox{Cov}(z_{1}, u) = 0$ ), and $z_{1}$ must be correlated with the endogenous regressor $x_{k}$ . Then, you can model $x_{k}$ as

$x_{ki} = \pi _{0} + \pi _{1}x_{1i} + \cdots + \pi _{k-1}x_{(k-1)i} + \theta z_{1i} + \epsilon _{i}$

You can now write this reduced form equation along with the structural equation to obtain the consistent maximum likelihood estimates as follows:

   proc qlim data=a;
      model y = x1 x2 x3 x4;
      model x4 = x1 x2 x3 z1;
   run;

Estimating the structural model together with the reduced form models for the endogenous explanatory variables gives you the full information maximum likelihood (FIML) estimates. Because of the linearity of the model, you can estimate it efficiently and more simply by using the two-stage least squares estimator. However, PROC QLIM handles nonlinear models such as qualitative response and limited dependent variable models, and in their estimation it maximizes the corresponding joint likelihood function. In the case of endogeneity, when the reduced form models for the endogenous explanatory variables are written along with the structural model, PROC QLIM maximizes the likelihood function that is obtained from the joint density of the response variable and the endogenous explanatory variables. For example, consider the following censored regression model in which one of the explanatory variables is a continuous endogenous variable:

$\begin{eqnarray*} y^{*}_{1i} & = & \alpha y_{2i} + \mathbf{z}_{1i}’\bbeta + u_{i} \\ y_{2i} & = & \mathbf{z}_{i}’\bpi + \epsilon _{i} \\ y_{1i} & = & \left\{ \begin{array}{ll} y^{*}_{1i} & \mr {~ if~ } y^{*}_{1i}>0 \\ 0 & \mr {~ if~ } y^{*}_{1i}\leq 0 \end{array} \right. \end{eqnarray*}$

The exogenous explanatory variables are $\mathbf{z}_{1i}$ , and the continuous endogenous explanatory variable is $y_{2i}$ .

The likelihood function to maximize is

$L = \prod _{i\in \{ y_{1i}>0\} } f(y_{1i}, y_{2i}) \cdot \prod _{i\in \{ y_{1i}=0\} } \int _{-\infty }^{0}f(y^{*}_{1i}, y_{2i})dy^{*}_{1i}$

where $f(y^{*}_{1i}, y_{2i})$ is the joint density of $y^{*}_{1i}$ and $y_{2i}$ . Note that $y_{1i}$ is substituted for $y^{*}_{1i}$ when $y_{1i}>0$ . If you assume $(u_{i}, \epsilon _{i}) \stackrel{iid}{\sim } N(\mathbf{0}, \bSigma )$ with $\bSigma = \left[ \begin{array}{ll} \sigma _{u}^{2} & \eta \\ \eta & \sigma _{\epsilon }^{2} \end{array} \right]$ , then, by using $f(y^{*}_{1i}, y_{2i}) = f(y^{*}_{1i}|y_{2i})\cdot f(y_{2i})$ , you can write the likelihood function for each $i$ as a multiplication of two parts. The first part is the probability density function of the normal distribution with mean $\mathbf{z}_{i}’\bpi$ and variance $\sigma _{\epsilon }^{2}$ , and the second part follows a Tobit model that has latent mean $\alpha y_{2i} + \mathbf{z}_{1i}’\bpi + (\eta /\sigma _{\epsilon }^{2})(y_{2i} - \mathbf{z}_{i}’\bpi )$ and variance $\sigma _{u}^{2} - (\eta ^{2}/\sigma _{\epsilon }^{2})$ . Then, you can obtain the log-likelihood function by taking the log of this multiplication and summing over $i$ (for more information, see Wooldridge (2002, chapter 16)). This is the log-likelihood function that PROC QLIM maximizes. The parameters $(\hat{\alpha }, \hat{\bbeta }, \hat{\bpi }, \hat{\sigma }_{u}^{2}, \hat{\sigma }_{\epsilon }^{2}, \hat{\eta })$ that are obtained from this maximization are the FIML estimators. Assuming that the latent model includes two instrumental variables and two exogenous explanatory variables, you can estimate this model in PROC QLIM as follows:

   proc qlim data=a;
      model y1 = y2 z11 z12 / censored(lb=0);
      model y2 = z11 z12 z21 z22;
   run;

For simple examples like the preceding ones, you can derive the likelihood function easily. However, as the number of endogenous explanatory variables increases, if these variables have a discontinuous nature, if simultaneity among equations exists, or if a combination of these occurs, then the derivation of the likelihood function becomes cumbersome, or, in some cases, the likelihood function does not even have a closed analytical form.

PROC QLIM can handle endogeneity regardless of the nature of the endogenous explanatory variables for a single structural model. In the case of one endogenous explanatory variable, PROC QLIM reports the FIML estimates that are calculated by using the analytical likelihood function that is obtained from the joint distribution of the dependent variable and the endogenous variable. When there is more than one endogenous explanatory variable, the analytical form of the likelihood function is usually not available; in this case PROC QLIM reports the simulated maximum likelihood estimates. For the simulated maximum likelihood estimation method, PROC QLIM uses the Geweke-Hajivassiliou-Keane (GHK) simulator (see, among others, Hajivassiliou, McFadden, and Ruud (1996)) to simulate the joint distribution of the dependent variable and the endogenous variables. The simulation is facilitated by assuming that the error terms in the latent models for the dependent variable and the endogenous explanatory variables are distributed as multivariate normal.

When you estimate a model in PROC QLIM, you can take the endogeneity into account by writing the structural model along with the reduced form models for each endogenous variable. Examples are provided in the following sections.

Probit Model with a Continuous Endogenous Explanatory Variable

Consider a probit model that contains a single endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

$\begin{eqnarray*} y_{1i}^* & = & \alpha _{1} y_{2i} + \beta _{1}z_{1i} + \beta _{2}z_{2i} + u_{i} \\ y_{2i}^* & = & \pi _{1}z_{1i} + \pi _{2}z_{2i} + \pi _{3}z_{3i} + \pi _{4}z_{4i} + \epsilon _{i} \\ y_{1i} & = & \left\{ \begin{array}{ll} 1 & \mr {~ if~ } y^{*}_{1i}>0 \\ 0 & \mr {~ if~ } y^{*}_{1i}\leq 0 \\ \end{array} \right. \\ y_{2i} & = & y_{2i}^* \end{eqnarray*}$

where $\mbox{Cov}(u, \epsilon ) = \eta$ . You can estimate this model by using the following statements:

   proc qlim data=a;
      model y1 = y2 z1 z2 / discrete;
      model y2 = z1 z2 z3 z4;
   run;

Probit Model with a Binary Endogenous Explanatory Variable

Consider a probit model that contains a single binary endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

where $\mbox{Cov}(u, \epsilon ) = \eta$ . You can estimate this model by using the following statements:

   proc qlim data=a;
      model y1 = y2 z1 z2 / discrete;
      model y2 = z1 z2 z3 z4 / discrete;
   run;

Probit Model with a Censored Endogenous Explanatory Variable

Consider a probit model that contains a single censored (below zero) endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

where $\mbox{Cov}(u, \epsilon ) = \eta$ . You can estimate this model by using the following statements:

   proc qlim data=a;
      model y1 = y2 z1 z2 / discrete;
      model y2 = z1 z2 z3 z4 / censored(lb=0);
   run;

Censored Regression Model with a Binary Endogenous Explanatory Variable

Consider a Type 1 Tobit model that contains a single binary endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

$\begin{eqnarray*} y_{1i}^* & = & \alpha _{1} y_{2i} + \beta _{1}z_{1i} + \beta _{2}z_{2i} + u_{i} \\ y_{2i}^* & = & \pi _{1}z_{1i} + \pi _{2}z_{2i} + \pi _{3}z_{3i} + \pi _{4}z_{4i} + \epsilon _{i} \\ y_{1i} & = & \left\{ \begin{array}{ll} y^{*}_{1i} & \mr {if} y^{*}_{1i}>0 \\ 0 & \mr {if} y^{*}_{1i}\leq 0 \\ \end{array} \right.\\ y_{2i} & = & \left\{ \begin{array}{ll} 1 & \mr {~ if~ } y^{*}_{2i}>0 \\ 0 & \mr {~ if~ } y^{*}_{2i}\leq 0 \\ \end{array} \right. \end{eqnarray*}$

where $\mbox{Cov}(u, \epsilon ) = \eta$ . You can estimate this model by using the following statements:

   proc qlim data=a;
      model y1 = y2 z1 z2 / censored(lb=0);
      model y2 = z1 z2 z3 z4 / discrete;
   run;

Censored Regression Model with Binary and Continuous Endogenous Explanatory Variables

Consider a Type 1 Tobit model that contain binary and continuous endogenous explanatory variables in addition to two instruments and two exogenous explanatory variables. The model is

$\begin{eqnarray*} y_{1i}^* & = & \alpha _{1} y_{21i} + \alpha _{2} y_{22i} + \beta _{1}z_{1i} + \beta _{2}z_{2i} + u_{i} \\ y_{21i}^* & = & \pi _{11}z_{1i} + \pi _{12}z_{2i} + \pi _{13}z_{3i} + \pi _{14}z_{4i} + \epsilon _{1i} \\ y_{22i}^* & = & \pi _{21}z_{1i} + \pi _{22}z_{2i} + \pi _{23}z_{3i} + \pi _{24}z_{4i} + \epsilon _{2i} \\ y_{1i} & = & \left\{ \begin{array}{ll} y^{*}_{1i} & \mr {~ if~ } y^{*}_{1i}>0 \\ 0 & \mr {~ if~ } y^{*}_{1i}\leq 0 \\ \end{array} \right. \\ y_{21i} & = & \left\{ \begin{array}{ll} 1 & \mr {~ if~ } y^{*}_{21i}>0 \\ 0 & \mr {~ if~ } y^{*}_{21i}\leq 0 \\ \end{array} \right.\\ y_{22i} & = & y^{*}_{22i} \end{eqnarray*}$

where $\mbox{Cov}(u, \epsilon _{1}, \epsilon _{2}) = \eta$ . You can estimate this model by using the following statements:

   proc qlim data=a;
      model y1 = y21 y22 z1 z2 / censored(lb=0);
      model y21 = z1 z2 z3 z4  / discrete;
      model y22 = z1 z2 z3 z4;
   run;

Probit Model with Binary, Censored, and Truncated Endogenous Explanatory Variables

Consider a probit model that contains binary, censored (below zero), and truncated (below zero) endogenous explanatory variables. The model is

$\begin{eqnarray*} y_{1i}^* & = & \alpha _{1} y_{21i} + \alpha _{2} y_{22i} + \alpha _{3} y_{23i} + u_{i} \\ y_{21i}^* & = & \pi _{11}z_{1i} + \pi _{12}z_{2i} + \pi _{13}z_{3i} + \pi _{14}z_{4i} + \epsilon _{1i} \\ y_{22i}^* & = & \pi _{21}z_{1i} + \pi _{22}z_{2i} + \pi _{23}z_{3i} + \pi _{24}z_{4i} + \epsilon _{2i} \\ y_{23i}^* & = & \pi _{31}z_{1i} + \pi _{32}z_{2i} + \pi _{33}z_{3i} + \pi _{34}z_{4i} + \epsilon _{3i} \\ y_{1i} & = & \left\{ \begin{array}{ll} 1 & \mr {~ if~ } y^{*}_{1i}>0 \\ 0 & \mr {~ if~ } y^{*}_{1i}\leq 0 \\ \end{array} \right. \\ y_{21i} & = & \left\{ \begin{array}{ll} 1 & \mr {~ if~ } y^{*}_{21i}>0 \\ 0 & \mr {~ if~ } y^{*}_{21i}\leq 0 \\ \end{array} \right.\\ y_{22i} & = & \left\{ \begin{array}{ll} y^{*}_{22i} & \mr {if} y^{*}_{22i}>0 \\ 0 & \mr {~ if~ } y^{*}_{22i}\leq 0 \\ \end{array} \right.\\ y_{23i} & = & y^{*}_{23i} \mr {~ ~ ~ if~ ~ ~ } y^{*}_{23i}>0 \end{eqnarray*}$

where $z_{1}, \ldots , z_{4}$ are the instrumental variables that are independent of the errors. You can estimate this model by using the following statements:

   proc qlim data=a;
      model y1  = y21 y22 y23 / discrete;
      model y21 = z1 z2 z3 z4 / discrete;
      model y22 = z1 z2 z3 z4 / censored(lb=0);
      model y23 = z1 z2 z3 z4 / truncated(lb=0);
   run;

Note that the dependent variable $y_{1}$ should not occur in the models for the endogenous explanatory variables, because this causes inconsistent coefficient estimates. In other words, you should write the models for the endogenous explanatory variables as reduced form models. PROC QLIM does not handle simultaneous equations models.

Test for Endogeneity

PROC QLIM has two ways to test the null hypothesis that an endogenous explanatory variable (EEV) is in fact exogenous. In the case of a single EEV, the first testing method involves a likelihood ratio test of $H_0: \_ \mbox{rho} = 0$ . For example, consider the probit model with a binary endogenous explanatory variable that was considered earlier; $y_2$ is exogenous if the error term in the model for $y_1^*$ is uncorrelated with the error term in the model for $y_2^*$ . Therefore, testing to determine whether this correlation is 0 or not provides an endogeneity test for $y_2$ . You can do this in PROC QLIM as follows:

   proc qlim data=a;
      model y1 = y2 z1 z2 / discrete;
      model y2 = z1 z2 z3 z4 / discrete;
      test  _rho = 0 / LR;
   run;

Failing to reject the null hypothesis favors the decision that $y2$ is exogenous in the model for $y1$ .

When there are two or more EEVs, the test becomes the joint likelihood ratio test of whether corresponding correlations are 0 or not.

The second testing method is similar to the approach of Rivers and Vuong (1988). Considering the same model, you can write

$u_{i} = \theta \epsilon _{i} + e_{i}$

where $\theta = \eta /\sigma _{\epsilon }^{2}$ and $e$ is independent of $z$ s and $\epsilon$ . You can now write

$\begin{eqnarray*} y_{1i}^* & = & \alpha _{1} y_{2i} + \beta _{1}z_{1i} + \beta _{2}z_{2i} + \theta \epsilon _{i} + e_{i} \\ \end{eqnarray*}$

Testing $H_0: \theta = 0$ is the same as testing whether $u_{i}$ is correlated with $\epsilon _{i}$ or testing whether $y_{2i}$ is endogenous or not. Because $\epsilon _{i}$ are unobserved, you can replace them with the OLS residuals from the model for $y_{2i}^{*}$ and apply a robust t test. Note that even though $y_{2i}$ is binary (or censored), the test is still correct under $H_0$ .

This approach can be summarized as a two-step procedure. In the first step, generated regressors—that is, the OLS residuals from the models for each of the EEVs—are obtained. In the second step, the structural model that includes the generated regressors as additional explanatory variables is estimated by the maximum likelihood method and the joint significance of these generated regressors is tested by the Wald test.

In PROC QLIM, you can apply the second method for the same test that was considered previously as follows:

   proc qlim data=a;
      model y1 = y2 z1 z2 / discrete endotest(y2);
      model y2 = z1 z2 z3 z4 / discrete;
   run;

Overidentification Test

In PROC QLIM you can test the validity of instrumental variables (IVs) by specifying the OVERID option in the ENDOGENOUS or MODEL statement. The OVERID test is a maximum likelihood version of the overidentifying restrictions test in the IV framework. If you have more IVs than are necessary for identification—that is, overidentifying IVs—you can use them to test the validity of your IVs. When you use the OVERID option to specify the overidentifying IVs, it applies the likelihood ratio test of the joint significance of these IVs, included as additional explanatory variables in the structural model that it estimates by the MLE jointly with the reduced form models. In effect, you test whether the overidentifying IVs are correlated with the error term in the structural model. You specify the reduced form models through the overidentifying IVs. The structural model is the model that includes the OVERID option. For example, consider the probit model that contains a continuous endogenous explanatory variable. You can consider $z3$ or $z4$ in the model for $y2$ as an overidentifying IV; therefore, you can specify the OVERID test as follows:

   proc qlim data=a;
      model y1 = y2 z1 z2 / discrete overid(y2.z4);
      model y2 = z1 z2 z3 z4;
   run;

In this case, PROC QLIM estimates the structural model $y1$ , including the overidentifying IV $z4$ as an additional explanatory variable in this model, jointly with the reduced form model $y2$ . Then it uses the likelihood ratio test to test the hypothesis that the overidentifying IV is insignificant. Rejecting this hypothesis raises doubts about the validity of the instruments $z3$ and $z4$ .

Note that, as long as you have continuous endogenous explanatory variables, the test result is invariant to which overidentifying IVs you specify in the test.