In sample selection models, one or several dependent variables are observed when another variable takes certain values. For example, the standard Heckman selection model can be defined as



where and are jointly normal with 0 mean, standard deviations of 1 and , respectively, and correlation of . Selection is based on the variable , and is observed when has a value of 1. Least squares regression that uses the observed data of produces inconsistent estimates of . The maximum likelihood method is used to estimate selection models. It is also possible to estimate these models by using Heckman’s method, which is more computationally efficient. But it can be shown that the resulting estimates, although consistent, are not asymptotically efficient under a normality assumption. Moreover, this method often violates the constraint on the correlation coefficient .
The loglikelihood function of the Heckman selection model is written as






The selection can be based on only one variable, but the selection can lead to several variables. For example, selection is based on the variable in the following switching regression model:








If , then is observed. If , then is observed. Because and are never observed at the same time, the correlation between and cannot be estimated. Only the correlation between and and the correlation between and can be estimated. This estimation uses the maximum likelihood method.
A brief example of the SAS statements for this model can be found in Sample Selection Model.
The Heckman selection model can include censoring or truncation. For a brief example of the SAS statements for these models see Sample Selection Model with Truncation and Censoring. The following example shows a variable that is censored from below at zero:




In this case, the loglikelihood function of the Heckman selection model needs to be modified as follows to include the censored region:









In case is truncated from below at 0 instead of censored, the likelihood function can be written as






Sample selection bias arises from nonrandom selection of the sample from the population. A classic example is using a sample of market wages for working women to estimate female labor supply function. This sample is nonrandom because it includes only the wages of women whose market wage exceeds their home wage at zero hours of work.
A simple selection model can be written as the latent model



where and are jointly normal with 0 mean, standard deviations of 1 and , respectively, and correlation of . The dependent variable (wage) is observed if the latent variable (the difference between market wage and reservation wage) is positive or if the indicator variable (labor force participation) is 1.
The model of interest that applies to the observations in the selected sample can be written as

where . Hence, the following regression equation is valid for the observations for which :

Therefore, estimates of that are obtained from the OLS regression of on by using the selected sample (that is, the sample for which ) suffer from omitted variable bias if selection bias is really the case. Although maximum likelihood estimation of is consistent and efficient, Heckman’s twostep method is more frequently used. Heckman’s twostep method can be requested by specifying the HECKIT option of the QLIM statement.
Heckman’s twostep method is as follows:
Obtain , the estimate of the parameters of the probability that , by using regressors and the binary dependent variable by probit analysis for the full sample. Compute .
Obtain and , the estimates of and , by least squares regression of on and by using observations on the selected subsample.
The standard least squares estimators of the population variance and the variances of the estimated coefficients are incorrect. To test hypotheses, the correct ones need to be calculated. An estimator of is

where is the selected subsample size, is the residual for the th observation obtained from step 2, and . Let be an matrix with th row , and define similarly with th row . Then the estimator of the asymptotic covariance of is

where , , and

where is the estimator of the asymptotic covariance of the probit coefficients that are obtained in step 1. With the HECKIT option, a numerical estimated asymptotic variance is used.
In the selected regression model, when the coefficient of is zero, there is no need for Heckman’s twostep estimation method; a simple regression of on produces consistent estimates for , and the OLS standard errors are correct. Thus, a standard test on (using the estimate from step 2 and the uncorrected standard errors) is a valid test of the null hypothesis of no selection bias.