Estimation Methods

Consider the general nonlinear model:

$\displaystyle  {\bepsilon }_{t}  $
$\displaystyle = $
$\displaystyle  \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, {\btheta })  $
$\displaystyle \Strong{z}_{t}  $
$\displaystyle = $
$\displaystyle  Z(\Strong{x}_{t}) \nonumber  $

where q ${\in }{ R^{g}}$ is a real vector valued function of y$_{t}$${\in }{ R^{g}}$, x$_{t}$${\in } { R^{l}}$, ${\btheta }{\in }{ R^{p}}$, where g is the number of equations, l is the number of exogenous variables (lagged endogenous variables are considered exogenous here), p is the number of parameters, and t ranges from 1 to n. ${\mb {z}_{t}}$${\in }R^{k}$ is a vector of instruments. ${\bepsilon }_{t}$ is an unobservable disturbance vector with the following properties:

$\displaystyle  E({\bepsilon }_{t})  $
$\displaystyle = $
$\displaystyle  0  $
$\displaystyle E({\bepsilon }_{t} {\bepsilon }^{}_{t} )  $
$\displaystyle = $
$\displaystyle  {\bSigma } \nonumber  $

All of the methods implemented in PROC MODEL aim to minimize an objective function. The following table summarizes the objective functions that define the estimators and the corresponding estimator of the covariance of the parameter estimates for each method.

Table 19.1: Summary of PROC MODEL Estimation Methods



Objective Function

Covariance of $\theta $



${\mb {r} ’\mb {r} /n}$

${( \mb {X} ’(\mr {diag}(\mb {S} )^{-1} {\otimes } \mb {I} )\mb {X} )^{-1}}$



${\mb {r} ’(\mr {diag}(\mb {S} )^{-1} {\otimes } \mb {I} )\mb {r} /n}$

${( \mb {X} ’(\mr {diag}(\mb {S} )^{-1} {\otimes } \mb {I} )\mb {X} )^{-1}}$



${\mb {r} ’( \mb {S} ^{-1}_{\mr {OLS}} {\otimes } \mb {I} )\mb {r} /n}$

${( \mb {X} ’(\mb {S} ^{-1} {\otimes } \mb {I} )\mb {X} )^{-1}}$



${\mb {r} ’(\mb {S} ^{-1} {\otimes } \mb {I} )\mb {r} /n}$

${( \mb {X} ’(\mb {S} ^{-1} {\otimes } \mb {I} )\mb {X} )^{-1}}$



${\mb {r} ’(\mb {I} {\otimes } \mb {W} )\mb {r} /n}$

${( \mb {X} ’(\mr {diag}(\mb {S} )^{-1} {\otimes } \mb {W} )\mb {X} )^{-1}}$



${\mb {r} ’(\mr {diag}(\mb {S} )^{-1} {\otimes } \mb {W} )\mb {r} /n}$

${( \mb {X} ’(\mr {diag}(\mb {S} )^{-1} {\otimes } \mb {W} )\mb {X} )^{-1}}$



${\mb {r} ’( \mb {S} ^{-1}_{\mr {N2SLS}} {\otimes } \mb {W} )\mb {r} /n}$

${( \mb {X} ’(\mb {S} ^{-1} {\otimes } \mb {W} )\mb {X} )^{-1}}$



${\mb {r} ’(\mb {S} ^{-1} {\otimes } \mb {W} )\mb {r} /n}$

${( \mb {X} ’(\mb {S} ^{-1} {\otimes } \mb {W} )\mb {X} )^{-1}}$



${[n\mb {m}_{n}(\theta )]’ \hat{\mb {V}}^{-1}_{\mr {N2SLS}}[n\mb {m}_{n}(\theta )]}/n$

${ [(\mb {Y} \mb {X} )’\hat{\mb {V}}^{-1}(\mb {Y} \mb {X} )]^{-1}}$



${[n\mb {m}_{n}(\theta )]’ \hat{\mb {V}}^{-1}[n\mb {m}_{n}(\theta )]}/n$

${ [(\mb {Y} \mb {X} )’\hat{\mb {V}}^{-1}(\mb {Y} \mb {X} )]^{-1}}$



${constant+\frac{n}{2}{\ln }(\det (\mb {S} ))}$

${ [\hat{\mb {Z}} ’ ( \mb {S} ^{-1} {\otimes } \mb {I} ) \hat{\mb {Z}}]^{-1}}$


${-\sum _{1}^{n}{{\ln } |(\mb {J}_{t})|}}$


The column labeled Instruments identifies the estimation methods that require instruments. The variables used in this table and the remainder of this chapter are defined as follows:

${n =}$ is the number of nonmissing observations.

${g =}$ is the number of equations.

${k =}$ is the number of instrumental variables.

${\mb {r} = \left[\begin{matrix} r_{1}   \\ r_{2}   \\ {\vdots }   \\ r_{g}  \end{matrix}\right]}$ is the ${ng \times 1}$ vector of residuals for the g equations stacked together.

${\mb {r}_{i} = \left[\begin{matrix}  q_{i}(\mb {y}_{1}, \mb {x}_{1}, \btheta )   \\ q_{i}(\mb {y}_{2}, \mb {x}_{2}, \btheta )   \\ \vdots   \\ q_{i}(\mb {y}_{n}, \mb {x}_{n}, \btheta )   \end{matrix}\right]}$ is the ${n \times 1}$ column vector of residuals for the i th equation.


is a ${g \times g}$ matrix that estimates ${\bSigma }$, the covariances of the errors across equations (referred to as the S matrix).


is an ${ng \times p}$ matrix of partial derivatives of the residual with respect to the parameters.


is an ${n \times n}$ matrix, ${\mb {Z} (\mb {Z} ’\mb {Z} )^{-1}\mb {Z} ’}$.


is an ${n \times k}$ matrix of instruments.


is a ${gk \times ng}$ matrix of instruments. ${\mb {Y} = \mb {I}_{g} {\otimes } \mb {Z} ’}$.

$\hat{\mb {Z}}$

${\hat{\mb {Z}} = ( \hat{Z}_{1}, \hat{Z}_{2}, {\ldots },\hat{Z}_{p} )}$ is an ${ng {\times } p}$ matrix. ${\hat{Z}_{i}}$ is a ${ng {\times } 1}$ column vector obtained from stacking the columns of

$\displaystyle  \Strong{U} \frac{1}{n} \sum _{t=1}^{n}{ \left( \frac{\partial \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, {\btheta })}{\partial y_{t}} \right)}^{-1} \frac{{\partial }^{2} \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t},{\btheta })}{{\partial } y_{t}{\partial } \theta _{i}} - \Strong{Q}_{i} \nonumber  $

is an ${n {\times } g}$ matrix of residual errors. ${\mb {U} = {{\bepsilon }_{1}, {{\bepsilon }}_{2}, {\ldots }, {{\bepsilon }}_{n}}’}$.


is the ${n {\times } g}$ matrix ${{\mb {q} (\mb {y}_{1}, \mb {x}_{1}, \btheta ), \mb {q} (\mb {y}_{2}, \mb {x}_{2}, \btheta ),{\ldots }, \mb {q} (\mb {y}_{n}, \mb {x}_{n}, \btheta )}}$.

Q $_{i}$

is an ${n {\times } g}$ matrix ${\frac{{\partial }\mb {Q}}{{\partial } \theta _{i}} }$.


is an ${n \times n}$ identity matrix.

J $_{t}$

is ${\frac{{\partial } \mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta )}{{\partial }\mb {y}_{t}^{}}}$, which is a ${g \times g}$ Jacobian matrix.

${\mb {m}_{n}}$

is first moment of the crossproduct ${\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) {\otimes } \mb {z}_{t}}$,

${m_{n}=\frac{1}{n} \sum _{t=1}^{n}{\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) {\otimes } \mb {z}_{t}}}$

z $_{t}$

is a ${k}$ column vector of instruments for observation ${t}$. $\mb {z} ’_{t}$ is also the ${t}$th row of Z.

$\hat{\mb {V}}$

is the ${gk \times gk}$ matrix that represents the variance of the moment functions.


is the number of instrumental variables used.


is the constant ${\frac{ng}{2} (1 + {\ln }( 2 {\pi } ))}$.

${\otimes }$

is the notation for a Kronecker product.

All vectors are column vectors unless otherwise noted. Other estimates of the covariance matrix for FIML are also available.

Dependent Regressors and Two-Stage Least Squares

Ordinary regression analysis is based on several assumptions. A key assumption is that the independent variables are in fact statistically independent of the unobserved error component of the model. If this assumption is not true (if the regressor varies systematically with the error), then ordinary regression produces inconsistent results. The parameter estimates are biased.

Regressors might fail to be independent variables because they are dependent variables in a larger simultaneous system. For this reason, the problem of dependent regressors is often called simultaneous equation bias. For example, consider the following two-equation system:

\[  y_{1} = a_{1} + b_{1} y_{2} + c_{1} x_{1} + {\epsilon }_{1}  \]
\[  y_{2} = a_{2} + b_{2} y_{1} + c_{2} x_{2} + {\epsilon }_{2}  \]

In the first equation, y$_{2}$ is a dependent, or endogenous, variable. As shown by the second equation, y$_{2}$ is a function of y$_{1}$, which by the first equation is a function of ${\epsilon }$$_{1}$, and therefore y$_{2}$ depends on ${\epsilon }$$_{1}$. Likewise, y$_{1}$ depends on ${\epsilon }$$_{2}$ and is a dependent regressor in the second equation. This is an example of a simultaneous equation system; y$_{1}$ and y$_{2}$ are a function of all the variables in the system.

Using the ordinary least squares (OLS) estimation method to estimate these equations produces biased estimates. One solution to this problem is to replace y$_{1}$ and y$_{2}$ on the right-hand side of the equations with predicted values, thus changing the regression problem to the following:

\[  y_{1} = a_{1} + b_{1} \hat{y}_{2} + c_{1} x_{1} + {\epsilon }_{1}  \]
\[  y_{2} = a_{2} + b_{2} \hat{y}_{1} + c_{2} x_{2} + {\epsilon }_{2}  \]

This method requires estimating the predicted values ${\hat{y}_{1}}$ and ${\hat{y}_{2}}$ through a preliminary, or first stage, instrumental regression. An instrumental regression is a regression of the dependent regressors on a set of instrumental variables, which can be any independent variables useful for predicting the dependent regressors. In this example, the equations are linear and the exogenous variables for the whole system are known. Thus, the best choice for instruments (of the variables in the model) are the variables x$_{1}$ and x$_{2}$.

This method is known as two-stage least squares or 2SLS, or more generally as the instrumental variables method. The 2SLS method for linear models is discussed in Pindyck (1981, p. 191–192). For nonlinear models this situation is more complex, but the idea is the same. In nonlinear 2SLS, the derivatives of the model with respect to the parameters are replaced with predicted values. See the section Choice of Instruments for further discussion of the use of instrumental variables in nonlinear regression.

To perform nonlinear 2SLS estimation with PROC MODEL, specify the instrumental variables with an INSTRUMENTS statement and specify the 2SLS or N2SLS option in the FIT statement. The following statements show how to estimate the first equation in the preceding example with PROC MODEL:

   proc model data=in;
      y1 = a1 + b1 * y2 + c1 * x1;
      fit y1 / 2sls;
      instruments x1 x2;

The 2SLS or instrumental variables estimator can be computed by using a first-stage regression on the instrumental variables as described previously. However, PROC MODEL actually uses the equivalent but computationally more appropriate technique of projecting the regression problem into the linear space defined by the instruments. Thus, PROC MODEL does not produce any first stage results when you use 2SLS. If you specify the FSRSQ option in the FIT statement, PROC MODEL prints First-Stage R$^{2}$ statistic for each parameter estimate.

Formally, the ${\hat{\btheta }}$ that minimizes

$\displaystyle  \hat{S}_{n} = \frac{1}{n} \left(\sum _{t=1}^{n}{(\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \theta ) {\otimes }\Strong{z}_{t})}\right)’ \left(\sum _{t=1}^{n} I {\otimes }\Strong{z}_{t}\Strong{z}_{t}’\right)^{-1} \left(\sum _{t=1}^{n}{(\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t},{\btheta }) {\otimes } \Strong{z}_{t})}\right) \nonumber  $

is the N2SLS estimator of the parameters. The estimate of ${\bSigma }$ at the final iteration is used in the covariance of the parameters given in Table 19.1. See Amemiya (1985, p. 250) for details on the properties of nonlinear two-stage least squares.

Seemingly Unrelated Regression

If the regression equations are not simultaneous (so there are no dependent regressors), seemingly unrelated regression (SUR) can be used to estimate systems of equations with correlated random errors. The large-sample efficiency of an estimation can be improved if these cross-equation correlations are taken into account. SUR is also known as joint generalized least squares or Zellner regression. Formally, the ${\hat{\btheta }}$ that minimizes

\[  \hat{S}_{n} = \frac{1}{n} \sum _{t=1}^{n}{\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta )’ \hat{{\bSigma }}^{-1} \mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta })  \]

is the SUR estimator of the parameters.

The SUR method requires an estimate of the cross-equation covariance matrix, ${\bSigma }$. PROC MODEL first performs an OLS estimation, computes an estimate, ${\hat{\bSigma }}$, from the OLS residuals, and then performs the SUR estimation based on ${\hat{\bSigma }}$. The OLS results are not printed unless you specify the OLS option in addition to the SUR option.

You can specify the ${\hat{\bSigma }}$ to use for SUR by storing the matrix in a SAS data set and naming that data set in the SDATA= option. You can also feed the ${\hat{\bSigma }}$ computed from the SUR residuals back into the SUR estimation process by specifying the ITSUR option. You can print the estimated covariance matrix $\hat{{\bSigma }}$ by using the COVS option in the FIT statement.

The SUR method requires estimation of the ${\bSigma }$ matrix, and this increases the sampling variability of the estimator for small sample sizes. The efficiency gain that SUR has over OLS is a large sample property, and you must have a reasonable amount of data to realize this gain. For a more detailed discussion of SUR, see Pindyck and Rubinfeld (1981, p. 331-333).

Three-Stage Least Squares Estimation

If the equation system is simultaneous, you can combine the 2SLS and SUR methods to take into account both dependent regressors and cross-equation correlation of the errors. This is called three-stage least squares (3SLS).

Formally, the ${\hat{\btheta }}$ that minimizes

$\displaystyle  \hat{S}_{n} = \frac{1}{n} \left(\sum _{t=1}^{n} (\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, {\btheta }) {\otimes }\Strong{z}_{t})\right)^{} \left(\sum _{t=1}^{n}{(\hat{{\bSigma }} {\otimes }\Strong{z}_{t}\Strong{z}_{t}’)}\right)^{-1} \left(\sum _{t=1}^{n}{(\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t},{\btheta }){\otimes }\Strong{z}_{t})}\right) \nonumber  $

is the 3SLS estimator of the parameters. For more details on 3SLS, see Gallant (1987, p. 435).

Residuals from the 2SLS method are used to estimate the ${\bSigma }$ matrix required for 3SLS. The results of the preliminary 2SLS step are not printed unless the 2SLS option is also specified.

To use the three-stage least squares method, specify an INSTRUMENTS statement and use the 3SLS or N3SLS option in either the PROC MODEL statement or a FIT statement.

Generalized Method of Moments (GMM)

For systems of equations with heteroscedastic errors, generalized method of moments (GMM) can be used to obtain efficient estimates of the parameters. See the section Heteroscedasticity for alternatives to GMM.

Consider the nonlinear model

$\displaystyle  \bepsilon _{t}  $
$\displaystyle = $
$\displaystyle  \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, {\btheta })  $
$\displaystyle \Strong{z}_{t}  $
$\displaystyle = $
$\displaystyle  Z(\Strong{x}_{t}) \nonumber  $

where ${\mb {z}_{t}}$ is a vector of instruments and ${\bepsilon }_{t}$ is an unobservable disturbance vector that can be serially correlated and nonstationary.

In general, the following orthogonality condition is desired:

\[  E (\bepsilon _{t} {\otimes } \mb {z}_{t}) = 0  \]

This condition states that the expected crossproducts of the unobservable disturbances, ${{\bepsilon }_{t}}$, and functions of the observable variables are set to 0. The first moment of the crossproducts is

$\displaystyle  \Strong{m}_{n}  $
$\displaystyle = $
$\displaystyle  \frac{1}{n} \sum _{t=1}^{n}{\Strong{m} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )}  $
$\displaystyle \Strong{m} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )  $
$\displaystyle = $
$\displaystyle  \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta ) {\otimes } \Strong{z}_{t} \nonumber  $

where ${\mb {m} (\mb {y}_{t}, \mb {x}_{t}, \btheta ){\in } R^{gk}}$.

The case where ${gk > p}$ is considered here, where p is the number of parameters.

Estimate the true parameter vector ${\theta ^{0}}$ by the value of ${\hat{\theta }}$ that minimizes

\[  S(\theta , {\bV }) = [n\mb {m}_{n}(\theta )]’ {\bV }^{-1}[n\mb {m}_{n}(\theta )]/n  \]


$\displaystyle  {\bV } = \textrm{Cov}\left([n\Strong{m}_{n}(\theta ^{0})], [n\Strong{m}_{n}(\theta ^{0})]’\right) \nonumber  $

The parameter vector that minimizes this objective function is the GMM estimator. GMM estimation is requested in the FIT statement with the GMM option.

The variance of the moment functions, ${\bV }$, can be expressed as

$\displaystyle  {\bV }  $
$\displaystyle = $
$\displaystyle  E \left(\sum _{t=1}^{n}{\bepsilon _{t} {\otimes } \Strong{z}_{t}}\right) \left(\sum _{s=1}^{n}{\bepsilon _{s} {\otimes } \Strong{z}_{s}}\right)’  $
$\displaystyle  $
$\displaystyle = $
$\displaystyle  \sum _{t=1}^{n} \sum _{s=1}^{n}{E \left[ (\bepsilon _{t} {\otimes } \Strong{z}_{t}) ( \bepsilon _{s} {\otimes } \Strong{z}_{s})’\right]}  $
$\displaystyle  $
$\displaystyle = $
$\displaystyle  n {\bS }_{n}^{0} \nonumber  $

where ${ {\bS }_{n}^{0}}$ is estimated as

\[  \hat{{\bS }}_{n} = \frac{1}{n} \sum _{t=1}^{n} \sum _{s=1}^{n}{(\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) {\otimes }\mb {z}_{t})(\mb {q} (\mb {y}_{s}, \mb {x}_{s}, \btheta ) {\otimes } \mb {z}_{s})’}  \]

Note that ${\hat{\bS }_{n}}$ is a ${gk{\times }gk}$ matrix. Because Var ${(\hat{\bS }_{n})}$ does not decrease with increasing n, you consider estimators of ${ {\bS }_{n}^{0}}$ of the form:

$\displaystyle  \hat{{\bS }}_{n}(l(n))  $
$\displaystyle = $
$\displaystyle  \sum _{{\tau } = -n + 1}^{n-1}{\hat{w} ( \frac{\tau }{l(n)} ) {\bD } \hat{{\bS }}_{n,{\tau }}} {\bD }  $
$\displaystyle \hat{{\bS }}_{n,{\tau }}  $
$\displaystyle = $
$\displaystyle \begin{cases}  \sum \limits _{t=1+{\tau }}^{n}{[\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t},{\btheta }^{\# }) {\otimes }\Strong{z}_{t}][\Strong{q} (\Strong{y}_{t-{\tau }}, \Strong{x}_{t-{\tau }}, {\btheta }^{\# }){\otimes }\Strong{z}_{t-{\tau }}]’} &  {\tau }\ge 0 \\ (\hat{{\bS }}_{n,-{\tau }})’ &  {\tau }<0 \end{cases} $
$\displaystyle \hat{w}(\frac{\tau }{l(n)})  $
$\displaystyle = $
$\displaystyle \begin{cases}  w(\frac{\tau }{l(n)}) &  l(n) > 0 \\ \delta _{\tau ,0} &  l(n) = 0 \end{cases} \nonumber  $

where ${l(n)}$ is a scalar function that computes the bandwidth parameter, ${w({\cdot })}$ is a scalar valued kernel, and the Kronecker delta function, $\delta _{i,j}$, is 1 if $i=j$ and 0 otherwise. The diagonal matrix ${\bD }$ is used for a small sample degrees of freedom correction (Gallant 1987). The initial ${\theta ^{\# }}$ used for the estimation of ${\hat{{\bS }}_{n}}$ is obtained from a 2SLS estimation of the system. The degrees of freedom correction is handled by the VARDEF= option as it is for the S matrix estimation.

The following kernels are supported by PROC MODEL. They are listed with their default bandwidth functions.


$\displaystyle  w(x)  $
$\displaystyle = $
$\displaystyle \begin{cases}  1-|x| &  |x|\le 1 \\ 0 &  \text {otherwise} \end{cases} $
$\displaystyle l(n)  $
$\displaystyle = $
$\displaystyle  \frac{1}{2} n^{1 / 3} \nonumber  $


$\displaystyle  w(x)  $
$\displaystyle = $
$\displaystyle \begin{cases}  1-6|x|^{2}+6|x|^{3} &  0\le |x|\le \frac{1}{2} \\ 2(1-|x|)^{3} &  \frac{1}{2}\le |x|\le 1 \\ 0 &  \text {otherwise} \end{cases} $
$\displaystyle l(n)  $
$\displaystyle = $
$\displaystyle  n^{1 / 5} \nonumber  $

Quadratic spectral: KERNEL=QS

$\displaystyle  w(x)  $
$\displaystyle = $
$\displaystyle  \frac{25}{12{\pi }^{2} x^{2}} \left( \frac{\sin (6{\pi }x/5)}{6{\pi }x/5} - \cos (6{\pi }x/5) \right)  $
$\displaystyle l(n)  $
$\displaystyle = $
$\displaystyle  \frac{1}{2} n^{1 / 5} \nonumber  $

Figure 19.23: Kernels for Smoothing

Kernels for Smoothing

Details of the properties of these and other kernels are given in Andrews (1991). Kernels are selected with the KERNEL= option; KERNEL=PARZEN is the default. The general form of the KERNEL= option is

   KERNEL=( PARZEN | QS | BART, c, e )

where the ${e \ge 0}$ and ${c \ge 0}$ are used to compute the bandwidth parameter as

\[  l(n) = c n^{e}  \]

The bias of the standard error estimates increases for large bandwidth parameters. A warning message is produced for bandwidth parameters greater than ${n^{\frac{1}{3}}}$. For a discussion of the computation of the optimal ${l(n)}$, refer to Andrews (1991).

The Newey-West kernel (Newey and West 1987) corresponds to the Bartlett kernel with bandwidth parameter ${l(n) = L +1}$. That is, if the lag length for the Newey-West kernel is ${L}$, then the corresponding MODEL procedure syntax is KERNEL=(bart, L+1, 0).

Andrews and Monahan (1992) show that using prewhitening in combination with GMM can improve confidence interval coverage and reduce over rejection of t statistics at the cost of inflating the variance and MSE of the estimator. Prewhitening can be performed by using the %AR macros.

For the special case that the errors are not serially correlated—that is,

\[  E{(e_{t} {\otimes } \mb {z}_{t})(e_{s} {\otimes } \mb {z}_{s})} = 0 \hspace{1 cm} t {\ne } s  \]

the estimate for ${ {\bS }_{n}^{0}}$ reduces to

\[  \hat{{\bS }}_{n} = \frac{1}{n} \sum _{t=1}^{n}{[\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) {\otimes } \mb {z}_{t}] [\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) {\otimes } \mb {z}_{t}]’}  \]

The option KERNEL=(kernel,0,) is used to select this type of estimation when using GMM.

Covariance of GMM estimators

The covariance of GMM estimators, given a general weighting matrix ${\mb {V}}^{-1}_{\mr {G}}$, is

\[  { [(\mb {Y} \mb {X} )’{\mb {V}}^{-1}_{\mr {G}} (\mb {Y} \mb {X} )]^{-1}} (\mb {Y} \mb {X} )’{\mb {V}}^{-1}_{\mr {G}} \hat{\mb {V}}{\mb {V}} ^{-1}_{\mr {G}} (\mb {Y} \mb {X} ) { [(\mb {Y} \mb {X} )’{\mb {V}}^{-1}_{\mr {G}} (\mb {Y} \mb {X} )]^{-1}}  \]

By default or when GENGMMV is specified, this is the covariance of GMM estimators.

If the weighting matrix is the same as $\hat{\mb {V}}$, then the covariance of GMM estimators becomes

\[  { [(\mb {Y} \mb {X} )’ \hat{\mb {V}}^{-1} (\mb {Y} \mb {X} )]^{-1}}  \]

If NOGENGMMV is specified, this is used as the covariance estimators.

Testing Overidentifying Restrictions

Let r be the number of unique instruments times the number of equations. The value r represents the number of orthogonality conditions imposed by the GMM method. Under the assumptions of the GMM method, ${r-p}$ linearly independent combinations of the orthogonality should be close to zero. The GMM estimates are computed by setting these combinations to zero. When r exceeds the number of parameters to be estimated, the OBJECTIVE*N, reported at the end of the estimation, is an asymptotically valid statistic to test the null hypothesis that the overidentifying restrictions of the model are valid. The OBJECTIVE*N is distributed as a chi-square with ${r-p}$ degrees of freedom (Hansen 1982, p. 1049). When the GMM method is selected, the value of the overidentifying restrictions test statistic, also known as Hansen’s J test statistic, and its associated number of degrees of freedom are reported together with the probability under the null hypothesis.

Iterated Generalized Method of Moments (ITGMM)

Iterated generalized method of moments is similar to the iterated versions of 2SLS, SUR, and 3SLS. The variance matrix for GMM estimation is reestimated at each iteration with the parameters determined by the GMM estimation. The iteration terminates when the variance matrix for the equation errors change less than the CONVERGE= value. Iterated generalized method of moments is selected by the ITGMM option on the FIT statement. For some indication of the small sample properties of ITGMM, see Ferson and Foerster (1993).

Simulated Method of Moments (SMM)

The SMM method uses simulation techniques in model inference and estimation. It is appropriate for estimating models in which integrals appear in the objective function, and these integrals can be approximated by simulation. There might be various reasons for integrals to appear in an objective function (for example, transformation of a latent model into an observable model, missing data, random coefficients, heterogeneity, and so on).

This simulation method can be used with all the estimation methods except full information maximum likelihood (FIML) in PROC MODEL. SMM, also known as simulated generalized method of moments (SGMM), is the default estimation method because of its nice properties.

Estimation Details

A general nonlinear model can be described as

\[  {{\bepsilon }}_{t} = \mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta )  \]

where q ${\in }{ R^{g}}$ is a real vector valued function of y $_{t}$${\in }{ R^{g}}$, x $_{t}$${\in } { R^{l}}$, $\btheta \in {R^{p}}$; g is the number of equations; l is the number of exogenous variables (lagged endogenous variables are considered exogenous here); p is the number of parameters; and t ranges from 1 to n. $\bepsilon _{t}$ is an unobservable disturbance vector with the following properties:

$\displaystyle  E(\bepsilon _{t})  $
$\displaystyle = $
$\displaystyle  0  $
$\displaystyle E(\bepsilon _{t} \bepsilon ^{}_{t} )  $
$\displaystyle = $
$\displaystyle  {\bSigma } \nonumber  $

In many cases, it is not possible to write $\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta )$ in a closed form. Instead $\mb {q} $ is expressed as an integral of a function $\mb {f} $; that is,

\[  \mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) = \int \mb {f} (\mb {y}_{t}, \mb {x}_{t}, \btheta , \mb {u}_{t}) {d} P(\mb {u} )  \]

where f ${\in }{ R^{g}}$ is a real vector valued function of y $_{t}$${\in }{ R^{g}}$, x $_{t}$${\in } { R^{l}}$, $\btheta \in {R^{p}}$, and u $_{t}$${\in }{ R^{m}}$, $m$ is the number of stochastic variables with a known distribution $P(\mb {u} )$. Since the distribution of u is completely known, it is possible to simulate artificial draws from this distribution. Using such independent draws $\mb {u}_{ht}$, $h=1,\ldots , H$, and the strong law of large numbers, $\mb {q} $ can be approximated by

\[  {\frac{1}{H}}\sum _{h=1}^{H} \mb {f} (\mb {y}_{t}, \mb {x}_{t}, \btheta , \mb {u}_{ht}).  \]
Simulated Generalized Method of Moments (SGMM)

Generalized method of moments (GMM) is widely used to obtain efficient estimates for general model systems. When the moment conditions are not readily available in closed forms but can be approximated by simulation, simulated generalized method of moments (SGMM) can be used. The SGMM estimators have the nice property of being asymptotically consistent and normally distributed even if the number of draws $H$ is fixed (see McFadden 1989; Pakes and Pollard 1989).

Consider the nonlinear model

$\displaystyle  \bepsilon _{t}  $
$\displaystyle = $
$\displaystyle  \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta ) = {\frac{1}{H}}\sum _{h=1}^{H} \Strong{f} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta , \Strong{u}_{ht})  $
$\displaystyle \Strong{z}_{t}  $
$\displaystyle = $
$\displaystyle  Z(\Strong{x}_{t}) \nonumber  $

where ${\mb {z}_{t}} {\in }{ R^{k}}$ is a vector of $k$ instruments and $\bepsilon _{t}$ is an unobservable disturbance vector that can be serially correlated and nonstationary. In the case of no instrumental variables, ${\mb {z}_{t}}$ is 1. $\mb {q} (\mb {y}_{t}, \mb {x}_{t}, {\btheta })$ is the vector of moment conditions, and it is approximated by simulation.

In general, theory suggests the following orthogonality condition

\[  E (\bepsilon _{t} {\otimes } \mb {z}_{t}) = 0  \]

which states that the expected crossproducts of the unobservable disturbances, ${{\bepsilon }_{t}}$, and functions of the observable variables are set to 0. The sample means of the crossproducts are

$\displaystyle  \Strong{m}_{n}  $
$\displaystyle = $
$\displaystyle  \frac{1}{n} \sum _{t=1}^{n}{\Strong{m} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )}  $
$\displaystyle \Strong{m} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )  $
$\displaystyle = $
$\displaystyle  \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta ) {\otimes } \Strong{z}_{t} \nonumber  $

where ${\mb {m} (\mb {y}_{t}, \mb {x}_{t}, \btheta ){\in } R^{gk}}$. The case where ${gk > p}$, where $p$ is the number of parameters, is considered here. An estimate of the true parameter vector ${\theta ^{0}}$ is the value of ${\hat{\theta }}$ that minimizes

\[  S(\theta , V) = [n\mb {m}_{n}(\theta )]’ {\bV }^{-1}[n\mb {m}_{n} (\theta )]/n  \]


\[  {\bV } = \mr {Cov}\left(\mb {m} (\theta ^{0}), \mb {m} (\theta ^{0})’\right).  \]

The steps for SGMM are as follows:

1. Start with a positive definite $\hat{{\bV }}$ matrix. This $\hat{{\bV }}$ matrix can be estimated from a consistent estimator of $\theta $. If $\hat{\theta }$ is a consistent estimator, then $\mb {u}_ t$ for $t=1,\ldots ,n$ can be simulated $H’$ number of times. A consistent estimator of ${\bV }$ is obtained as

\[  \hat{{\bV }} = \frac{1}{n} \sum _{t=1}^{n}[ {\frac{1}{H}}\sum _{h=1}^{H} \mb {f} (\mb {y}_{t}, \mb {x}_{t}, \hat{\btheta }, \mb {u}_{ht}) {\otimes } \mb {z}_ t] [ {\frac{1}{H}}\sum _{h=1}^{H} \mb {f} (\mb {y}_{t}, \mb {x}_{t}, \hat{\btheta }, \mb {u}_{ht}) {\otimes } \mb {z}_ t]’  \]

$H’$ must be large so that this is an consistent estimator of ${\bV }$.

2. Simulate $H$ number of $\mb {u}_ t$ for $t=1,\ldots ,n$. As shown by Gourieroux and Monfort (1993), the number of simulations $H$ does not need to be very large. For $H=10$, the SGMM estimator achieves 90% of the efficiency of the corresponding GMM estimator. Find $\hat{\theta }$ that minimizes the quadratic product of the moment conditions again with the weight matrix being $\hat{{\bV }}^{-1}$.

\[  \min _{\theta } [n\mb {m}_{n}(\theta )]’ \hat{{\bV }}^{-1} [n\mb {m}_{n}(\theta )]/n  \]

3. The covariance matrix of $\sqrt {n}\theta $ is given as (Gourieroux and Monfort 1993)

\[  {\bSigma }_1^{-1} {\bD } \hat{{\bV }}^{-1} {\bV }(\hat{\theta }) \hat{{\bV }}^{-1} {\bD }’ {\bSigma }_1^{-1} +{\frac{1}{H}}{\bSigma }_1^{-1} {\bD } \hat{{\bV }}^{-1} E[\mb {z} {\otimes } Var(\mb {f} |\mb {x} ) {\otimes } \mb {z} ] \hat{{\bV }}^{-1} {\bD }’ {\bSigma }_1^{-1}  \]

where ${\bSigma }_1={\bD } \hat{{\bV }}^{-1} {\bD }$, ${\bD }$ is the matrix of partial derivatives of the residuals with respect to the parameters, ${\bV }(\hat{\theta })$ is the covariance of moments from estimated parameters $\hat{\theta }$, and $Var(\mb {f} |\mb {x} )$ is the covariance of moments for each observation from simulation. The first term is the variance-covariance matrix of the exact GMM estimator, and the second term accounts for the variation contributed by simulating the moments.

Implementation in PROC MODEL

In PROC MODEL, if the user specifies the GMM and NDRAW options in the FIT statement, PROC MODEL first fits the model by using N2SLS and computes $\hat{{\bV }}$ by using the estimates from N2SLS and $H’$ simulation. If NO2SLS is specified in the FIT statement, $\hat{{\bV }}$ is read from VDATA= data set. If the user does not provide a $\hat{{\bV }}$ matrix, the initial starting value of $\theta $ is used as the estimator for computing the $\hat{{\bV }}$ matrix in step 1. If ITGMM option is specified instead of GMM, then PROC MODEL iterates from step 1 to step 3 until the ${\bV }$ matrix converges.

The consistency of the parameter estimates is not affected by the variance correction shown in the second term in step 3. The correction on the variance of parameter estimates is not computed by default. To add the adjustment, use ADJSMMV option on the FIT statement. This correction is of the order of ${\frac{1}{H}}$ and is small even for moderate $H$.

The following example illustrates how to use SMM to estimate a simple regression model. Suppose the model is

\[  y = a + b x + u, u \sim iid \,  N(0, s^2).  \]

First, consider the problem in a GMM context. The first two moments of $y$ are easily derived:

$\displaystyle  E(y)  $
$\displaystyle = $
$\displaystyle  a + b x  $
$\displaystyle E(y^2)  $
$\displaystyle = $
$\displaystyle  (a+bx)^2 + s^2 \nonumber  $

Rewrite the moment conditions in the form similar to the discussion above:

$\displaystyle  \epsilon _{1t}  $
$\displaystyle = $
$\displaystyle  y_ t - ( a + b x_ t)  $
$\displaystyle \epsilon _{2t}  $
$\displaystyle = $
$\displaystyle  y_ t^2 - ( a + b x_ t)^2 -s^2 \nonumber  $

Then you can estimate this model by using GMM with following statements:

   proc model data=a;
      parms a b s;
      instrument x;
      eq.m1 = y-(a+b*x);
      eq.m2 = y*y - (a+b*x)**2 - s*s;
      bound s > 0;
      fit m1 m2 / gmm;

Now suppose you do not have the closed form for the moment conditions. Instead you can simulate the moment conditions by generating $H$ number of simulated samples based on the parameters. Then the simulated moment conditions are

$\displaystyle  \epsilon _{1t}  $
$\displaystyle = $
$\displaystyle  \frac{1}{H} \sum _{h=1}^ H \{  y_ t - ( a + b x_ t + s u_{t,h}) \}   $
$\displaystyle \epsilon _{2t}  $
$\displaystyle = $
$\displaystyle  \frac{1}{H} \sum _{h=1}^ H \{  y_ t^2 - ( a + b x_ t + s u_{t,h})^2 \}  \nonumber  $

This model can be estimated by using SGMM with the following statements:

   proc model data=_tmpdata;
      parms a b s;
      instrument x;
      ysim = (a+b*x) + s * rannor( 98711 );
      eq.m1 = y-ysim;
      eq.m2 = y*y - ysim*ysim;
      bound s > 0;
      fit m1 m2 / gmm ndraw=10;

You can use the following MOMENT statement instead of specifying the two moment equations above:

   moment ysim=(1, 2);

In cases where you require a large number of moment equations, using the MOMENT statement to specify them is more efficient.

Note that the NDRAW= option tells PROC MODEL that this is a simulation-based estimation. Thus, the random number function RANNOR returns random numbers in estimation process. During the simulation, 10 draws of $m1$ and $m2$ are generated for each observation, and the averages enter the objective functions just as the equations specified previously.

Other Estimation Methods

The simulation method can be used not only with GMM and ITGMM, but also with OLS, ITOLS, SUR, ITSUR, N2SLS, IT2SLS, N3SLS, and IT3SLS. These simulation-based methods are similar to the corresponding methods in PROC MODEL; the only difference is that the objective functions include the average of the $H$ simulations.

Full Information Maximum Likelihood Estimation (FIML)

A different approach to the simultaneous equation bias problem is the full information maximum likelihood (FIML) estimation method (Amemiya 1977).

Compared to the instrumental variables methods (2SLS and 3SLS), the FIML method has these advantages and disadvantages:

  • FIML does not require instrumental variables.

  • FIML requires that the model include the full equation system, with as many equations as there are endogenous variables. With 2SLS or 3SLS, you can estimate some of the equations without specifying the complete system.

  • FIML assumes that the equations errors have a multivariate normal distribution. If the errors are not normally distributed, the FIML method might produce poor results. 2SLS and 3SLS do not assume a specific distribution for the errors.

  • The FIML method is computationally expensive.

The full information maximum likelihood estimators of ${\theta }$ and ${{\sigma }}$ are the ${\hat{\theta }}$ and ${\hat{{\sigma }}}$ that minimize the negative log-likelihood function:

$\displaystyle  \Strong{l}_{n}(\btheta , \bsigma ) =  $
$\displaystyle \frac{ng}{2} $
$\displaystyle  {\ln }(2{\pi }) - \sum _{t=1}^{n}{{\ln } \left( \left| \frac{{\partial }\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )}{{\partial } \Strong{y}_{t}^{}} \right| \right)} + \frac{n}{2} {\ln } \left( |{\bSigma }({\sigma })| \right)  $
$\displaystyle  $
$\displaystyle + $
$\displaystyle  \frac{1}{2} \textrm{tr} \left( {\bSigma }({\sigma })^{-1} \sum _{t=1}^{n}{\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta ) \Strong{q} ’(\Strong{y}_{t}, \Strong{x}_{t}, \btheta )} \right) \nonumber  $

The option FIML requests full information maximum likelihood estimation. If the errors are distributed normally, FIML produces efficient estimators of the parameters. If instrumental variables are not provided, the starting values for the estimation are obtained from a SUR estimation. If instrumental variables are provided, then the starting values are obtained from a 3SLS estimation. The log-likelihood value and the l$_{2}$ norm of the gradient of the negative log-likelihood function are shown in the estimation summary.

FIML Details

To compute the minimum of ${\mb {l}_{n}(\btheta , {{\bsigma }})}$, this function is concentrated using the relation

\[  {\bSigma }(\theta ) = \frac{1}{n} \sum _{t=1}^{n}{\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) \mb {q} ’(\mb {y}_{t}, \mb {x}_{t}, \btheta )}  \]

This results in the concentrated negative log-likelihood function discussed in Davidson and MacKinnon (1993):

\[  \mb {l}_{n}(\btheta ) = \frac{ng}{2} (1 + {\ln }(2{\pi })) - \sum _{t=1}^{n}{{\ln }\biggl |\frac{{\partial }}{{\partial } \mb {y}_{t}^{}} \mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) \biggr |} + \frac{n}{2} {\ln } |{\bSigma }(\theta )|  \]

The gradient of the negative log-likelihood function is

\[  \frac{{\partial }}{{\partial } \theta _{i}} \mb {l}_{n}(\btheta ) = \sum _{t=1}^{n}{{\nabla }_{i}(t)}  \]
$\displaystyle  {\nabla }_{i}(t)  $
$\displaystyle = $
$\displaystyle  -\textrm{tr} \left( \left( \frac{{\partial }\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )}{{\partial } \Strong{y}_{t}^{}} \right)^{-1} \frac{{\partial }^{2}\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t},\btheta )}{{\partial } \Strong{y}_{t}^{} {\partial } \theta _{i}} \right)  $
$\displaystyle  $
$\displaystyle + $
$\displaystyle  \frac{1}{2} \textrm{tr} \left( \bSigma (\theta )^{-1} ~ \frac{{\partial }{\bSigma }(\theta )}{{\partial } \theta _{i}} \right.  $
$\displaystyle  $
$\displaystyle  $
$\displaystyle  \left. \left[ I - {\bSigma }(\theta )^{-1} \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta ) \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )’ \right] \right)  $
$\displaystyle  $
$\displaystyle + $
$\displaystyle  \Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta ’) {\bSigma }(\theta )^{-1} \frac{{\partial }\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )}{{\partial } \theta _{i}}  $


\[  \frac{{\partial }{\bSigma }(\theta )}{{\partial } \theta _{i}} = \frac{2}{n} \sum _{t=1}^{n}{\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) \frac{{\partial }\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta )}{{\partial } \theta _{i}} }  \]

The estimator of the variance-covariance of ${\hat{\theta }}$ (COVB) for FIML can be selected with the COVBEST= option with the following arguments:


selects the crossproducts estimator of the covariance matrix (Gallant 1987, p. 473):

$\displaystyle  C = \left( \frac{1}{n} \sum _{t=1}^{n} {{\nabla }(t) {\nabla }’(t)} \right)^{-1} \nonumber  $

where ${{\nabla }(t) = [{\nabla }_{1}(t), {\nabla }_{2}(t), {\ldots },{\nabla }_{p}(t)]’}$. This is the default.


selects the generalized least squares estimator of the covariance matrix. This is computed as (Dagenais 1978)

\[  C = [ \hat{{\bZ }} ’ ( {\bSigma }(\theta )^{-1} {\otimes } I) \hat{{\bZ }}]^{-1}  \]

where ${\hat{{\bZ }} = ( \hat{Z}_{1}, \hat{Z}_{2}, {\ldots },\hat{Z}_{p} )}$ is ${ng \times p}$ and each ${\hat{Z}_{i}}$ column vector is obtained from stacking the columns of

$\displaystyle  \bU \frac{1}{n} \sum _{t=1}^{n}{\left( \frac{{\partial }\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t}, \btheta )}{{\partial } y} \right)^{-1} \frac{{\partial }^{2}\Strong{q} (\Strong{y}_{t}, \Strong{x}_{t},\btheta )}{{\partial } \Strong{y}_{n}^{}{\partial } \theta _{i}} } - Q_{i} \nonumber  $

${\bU }$ is an ${n \times g}$ matrix of residuals and ${q_{i}}$ is an ${n \times g}$ matrix ${\frac{{\partial }\mb {Q}}{{\partial } \theta _{i}} }$.


selects the inverse of concentrated likelihood Hessian as an estimator of the covariance matrix. The Hessian is computed numerically, so for a large problem this is computationally expensive.

The HESSIAN= option controls which approximation to the Hessian is used in the minimization procedure. Alternate approximations are used to improve convergence and execution time. The choices are as follows:


The crossproducts approximation is used.


The generalized least squares approximation is used (default).


The Hessian is computed numerically by finite differences.

HESSIAN=GLS has better convergence properties in general, but COVBEST=CROSS produces the most pessimistic standard error bounds. When the HESSIAN= option is used, the default estimator of the variance-covariance of ${\hat{\theta }}$ is the inverse of the Hessian selected.

Multivariate t Distribution Estimation

The multivariate t distribution is specified by using the ERRORMODEL statement with the T option. Other method specifications (FIML and OLS, for example ) are ignored when the ERRORMODEL statement is used for a distribution other than normal.

The probability density function for the multivariate t distribution is

\[  P_ q = \frac{\Gamma ( \frac{df +m}{2} ) }{ (\pi * df)^{\frac{m}{2}} * \Gamma ( \frac{df}{2} ) \left| \bSigma (\sigma ) \right|^{\frac{1}{2}} } * \left( 1 + \frac{\mb {q} (\mb {y}_{t}, \mb {x}_{t} , \btheta ) \bSigma (\sigma )^{-1} \mb {q} (\mb {y}_{t} , \mb {x}_{t}, \btheta )}{df} \right)^{-\frac{df+m}{2} }  \]

where $m$ is the number of equations and $df$ is the degrees of freedom.

The maximum likelihood estimators of ${\theta }$ and ${{\sigma }}$ are the ${\hat{\theta }}$ and ${\hat{{\sigma }}}$ that minimize the negative log-likelihood function:

$\displaystyle  \Strong{l}_{n}(\btheta , \bsigma )  $
$\displaystyle = $
$\displaystyle  -\sum _{t=1}^ n \ln \left(\frac{ \Gamma ( \frac{df +m}{2} ) }{ ( \pi * df)^{\frac{m}{2}} * \Gamma ( \frac{df}{2} )} * \left( 1 + \frac{q_ t \bSigma ^{-1} q_ t}{df} \right)^{-\frac{df+m}{2} } \right) \nonumber  $
$\displaystyle  $
$\displaystyle  $
$\displaystyle  + \frac{n}{2}*\ln \left(\left|\bSigma \right|\right) - \sum _{t=1}^ n \ln \left(\left|\frac{\partial q_ t}{ \partial y_ t} \right| \right) \nonumber  $

The ERRORMODEL statement is used to request the t distribution maximum likelihood estimation. An OLS estimation is done to obtain initial parameter estimates and MSE.var estimates. Use NOOLS to turn off this initial estimation. If the errors are distributed normally, t distribution estimation produces results similar to FIML.

The multivariate model has a single shared degrees-of-freedom parameter, which is estimated. The degrees-of-freedom parameter can also be set to a fixed value. The log-likelihood value and the l$_{2}$ norm of the gradient of the negative log-likelihood function are shown in the estimation summary.

t Distribution Details

Since a variance term is explicitly specified by using the ERRORMODEL statement, ${\bSigma }(\theta )$ is estimated as a correlation matrix and $\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) $ is normalized by the variance. The gradient of the negative log-likelihood function with respect to the degrees of freedom is

$\displaystyle  \frac{\partial l_ n}{\partial df}  $
$\displaystyle = $
$\displaystyle  \frac{n m}{2 ~  df} - \frac{n}{2} \frac{\Gamma ( \frac{df +m}{2})}{\Gamma ( \frac{df +m}{2})} + \frac{n}{2} \frac{ \Gamma ( \frac{df}{2} )}{\Gamma ( \frac{df}{2} )} + \nonumber  $
$\displaystyle  $
$\displaystyle  $
$\displaystyle  0.5 \log (1+\frac{\Strong{q} \bSigma ^{-1} \Strong{q} }{df}) - \frac{0.5 (df + m)}{(1 +\frac{\Strong{q} \bSigma ^{-1} \Strong{q} }{df})} \frac{\Strong{q} \bSigma ^{-1} \Strong{q} }{df^2} \nonumber  $

The gradient of the negative log-likelihood function with respect to the parameters is

\[  \frac{\partial l_ n}{\partial \theta _ i} = \frac{0.5 (df + m)}{(1 +\mb {q} \bSigma ^{-1} \mb {q} /df)} \left[\frac{(2 ~  \mb {q} \bSigma ^{-1} \frac{\partial \mb {q} }{\partial \theta _ i} )}{df} + \mb {q} ’ \bSigma ^{-1} \frac{\partial \bSigma }{\partial \theta _ i} \bSigma ^{-1} \mb {q} \right] - \frac{n}{2} \mr {trace} ( \bSigma ^{-1} \frac{\partial \bSigma }{\partial \theta _ i})  \]


\[  \frac{{\partial }{\bSigma }(\theta )}{{\partial } \theta _{i}} = \frac{2}{n} \sum _{t=1}^{n}{\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta ) \frac{{\partial }\mb {q} (\mb {y}_{t}, \mb {x}_{t}, \btheta )}{{\partial } \theta _{i}} }  \]


\[  \mb {q} (\mb {y}_{t},\mb {x}_{t}, \btheta ) = \frac{ \epsilon (\theta ) }{\sqrt {h(\theta )}} \in R^{m \times n}  \]

The estimator of the variance-covariance of ${\hat{\theta }}$ (COVB) for the t distribution is the inverse of the likelihood Hessian. The gradient is computed analytically, and the Hessian is computed numerically.

Empirical Distribution Estimation and Simulation

The following SAS statements fit a model that uses least squares as the likelihood function, but represent the distribution of the residuals with an empirical cumulative distribution function (CDF). The plot of the empirical probability distribution is shown in Figure 19.24.

data t;  /* Sum of two normals  */
   format date monyy.;
   do t = 0 to 9.9 by 0.1;
      date = intnx( 'month', '1jun90'd, (t*10)-1 );
      y =  0.1 * (rannor(123)-10) +
            .5 * (rannor(123)+10);
ods select Model.Liklhood.ResidSummary

proc model data=t time=t itprint;
   dependent y;
   parm a 5;

   y = a;
   obj = resid.y * resid.y;
   errormodel y ~ general( obj )
   cdf=(empirical=(tails=( normal percent=10)));

   fit y / outsn=s out=r;
   id  date;

   solve y / data=t(where=(date='1aug98'd))
             residdata=r sdata=s
             random=200 seed=6789 out=monte ;

proc kde data=monte;
   univar y / plots=density;

Figure 19.24: Empirical PDF Plot

Empirical PDF Plot

For simulation, if the CDF for the model is not built in to the procedure, you can use the CDF=EMPIRICAL() option. This uses the sorted residual data to create an empirical CDF. For computing the inverse CDF, the program needs to know how to handle the tails. For continuous data, the tail distribution is generally poorly determined. To counter this, the PERCENT= option specifies the percentage of the observations to use in constructing each tail. The default for the PERCENT= option is 10.

A normal distribution or a t distribution is used to extrapolate the tails to infinity. The standard errors for this extrapolation are obtained from the data so that the empirical CDF is continuous.