The NLIN Procedure

Measures of Nonlinearity and Diagnostics

A close-to-linear nonlinear regression model, in the sense of Ratkowsky (1983, 1990), is a model in which parameter estimators have properties similar to those in a linear regression model. That is, the least squares estimators of the parameters are close to being unbiased and normally distributed, and they have minimum variance.

A nonlinear regression model sometimes fails to be close to linear due to the properties of one or several parameters. When this occurs, bias in the parameter estimates can render inferences that use the reported standard errors and confidence limits invalid.

PROC NLIN provides various measures of nonlinearity. To assess the nonlinearity of a model-data combination, you can use both of the following complementary sets of measures:

  • Box’s bias (Box, 1971) and Hougaard’s skewness (Hougaard, 1982, 1985) of the least squares parameter estimates

  • curvature measures of nonlinearity (Bates and Watts, 1980).

Furthermore, PROC NLIN provides residual, leverage, and local-influence diagnostics (St. Laurent and Cook, 1993).

In the following several sections, these nonlinearity measures and diagnostics are discussed. For this material, several basic definitions are required. Let $\mb {X}$ be the Jacobian matrix for the model, $\mb {X} = \frac{\partial \mb {f}}{\partial \bbeta }$, and let $\mb {Q}$ and $\mb {R}$ be the components of the QR decomposition of $\mb {X} = \mb {Q}\mb {R}$ of $\mb {X}$, where $\mb {Q}$ is an $(n\times n)$ orthogonal matrix. Finally, let $\mb {B}$ be the inverse of the matrix constructed from the first p rows of the $(n \times p)$ dimensional matrix $\mb {R}$ (that is, $\mb {B} = \mb {R}^{-1}_{p}$). Next define

$\displaystyle  \left[\mb {H}_ j\right]_{kl}  $
$\displaystyle = \frac{\partial ^2 \mb {f}_ j}{\partial \bbeta _ k \partial \bbeta _ l} $
$\displaystyle \left[\mb {U}_ j\right]_{kl}  $
$\displaystyle = \sum _{mn} \mb {B}’_{km} \left[\mb {H}_ j\right]_{mn} \mb {B}_{nl}  $
$\displaystyle \left[\mb {A}_ j\right]_{kl}  $
$\displaystyle = \sqrt {p \times \mr {mse} }\sum _{m} \mb {Q}’_{jm} \left[\mb {U}_ m\right]_{kl}\, ,  $

where $\mb {H}$, $\mb {U}$ and the acceleration array $\mb {A}$ are three-dimensional $(n \times p \times p)$ matrices. The first p faces of the acceleration array constitute a $(p \times p \times p)$ parameter-effects array and the last $(n-p)$ faces constitute the $(n-p \times p \times p)$ intrinsic curvature array (Bates and Watts, 1980). The previous and subsequent quantities are computed at the least squares parameter estimators.

Box’s Measure of Bias

The degree to which parameter estimators exhibit close-to-linear behavior can be assessed with Box’s bias (Box, 1971) and Hougaard’s measure of skewness (Hougaard, 1982, 1985). The bias and percentage bias measures are available through the BIAS option in the PROC NLIN statement. Box’s bias measure is defined as

$\displaystyle  \widehat{\mr {E}}\left[ \widehat{\theta } - \theta \right]  $
$\displaystyle = -\frac{\sigma ^2}{2} \left(\mb {X}’ \bW \mb {X}\right)^{-1} \sum ^{n}_{i=1} w_ i \mb {x}’_ i \, \text {Tr}\left( \left(\mb {X}’ \bW \mb {X}\right)^{-1} \left[\mb {H}_{i}\right]\right)  $

where $\sigma ^2 = \hbox{mse}$ if the SIGSQ option is not set. Otherwise, $\sigma ^2$ is the value you set with the SIGSQ option. $\mb {W}$ is the diagonal weight matrix specified with the _WEIGHT_ variable (or the identity matrix if _WEIGHT_ is not defined) and $\left[\mb {H}_{i}\right]$ is the $(p \times p)$ Hessian matrix at the ith observation. In the case of unweighted least squares, the bias formula can be expressed in terms of the acceleration array $\mb {A}$,

$\displaystyle  \widehat{\mr {E}}\left[ \widehat{\theta }_ i - {\theta }_ i\right]  $
$\displaystyle = - \frac{\sigma ^2}{2 p \times \hbox{mse}} \sum ^{p}_{j,\, k=1} \mb {B}_{ij} \left[\mb {A}_ j\right]_{kk}  $

As the preceding formulas illustrate, the bias depends solely on the parameter-effects array, thereby permitting its reduction through reparameterization. Example 63.4 shows how changing the parameterization of a four-parameter logistic model can reduce the bias. Ratkowsky (1983, p. 21) recommends that you consider reparameterization if the percentage bias exceeds 1%.

Hougaard’s Measure of Skewness

In addition to Box’s bias, Hougaard’s measure of skewness, $g_{1i}$ (Hougaard, 1982, 1985), is also provided in PROC NLIN to assess the close-to-linear behavior of parameter estimators. This measure is available through the HOUGAARD option in the PROC NLIN statement. Hougaard’s skewness measure for the ith parameter is based on the third central moment, defined as

\[  \mr {E} \left[ \widehat{\theta }_ i - \mr {E}\left(\widehat{\theta }_ i\right)\right]^3 = - \left(\sigma ^2\right)^2 \sum _{jkl} [\mb {L}]_{ij}[\mb {L}]_{ik}[\mb {L}]_{il} \left( \left[\mb {V}_ j\right]_{kl}+[\mb {V}_ k]_{jl}+[\mb {V}_ l]_{jk}\right)  \]

where the sum is a triple sum over the number of parameters and

\[  \mb {L} = \left(\mb {X}’\mb {X}\right)^{-1} = \left( \frac{\partial \mb {f}}{\partial \bbeta ^\prime } \frac{\partial \mb {f}}{\partial \bbeta } \right)^{-1}  \]

The term $[\mb {L}]_{ij}$ denotes the value in row i, column j of the matrix $\mb {L}$. (Hougaard (1985) uses superscript notation to denote elements in this inverse.) The matrix $\mb {V}$ is a three-dimensional $(p \times p \times p)$ array

$\displaystyle  \left[\mb {V}_ j\right]_{kl}  $
$\displaystyle = \sum _{m=1}^ n \frac{\partial F_ m}{\partial \beta _ j} \frac{\partial ^2 F_ m}{\partial \beta _ k \partial \beta _ l}  $

The third central moment is then normalized using the standard error as

\[  G_{1i} = \mr {E}\left[ \widehat{\theta }_ i - \mr {E}(\widehat{\theta }_ i)\right]^3 / \left(\sigma ^2 \times \left[ \mb {L} \right]_{ii} \right)^{3/2}  \]

The previous expressions depend on the unknown values of the parameters and on the residual variance $\sigma ^2$. In order to evaluate the Hougaard measure in a particular data set, the NLIN procedure computes

$\displaystyle  g_{1i}  $
$\displaystyle = \widehat{\mr {E}}\left[ \widehat{\theta }_ i - \mr {E}(\widehat{\theta }_ i)\right]^3 /\left( \mr {mse} \times [\widehat{\mb {L}}]_{ii} \right) ^{3/2}  $
$\displaystyle \widehat{\mr {E}}\left[ \widehat{\theta }_ i - \mr {E}(\widehat{\theta }_ i)\right]^3  $
$\displaystyle = - \mr {mse}^2 \sum _{jkl} [\widehat{\mb {L}}]_{ij} [\widehat{\mb {L}}]_{ik} [\widehat{\mb {L}}]_{il} \left( [\widehat{\mb {V}}_ j]_{kl} + [\widehat{\mb {V}}_ k]_{jl} + [\widehat{\mb {V}}_ l]_{jk} \right)  $

Following Ratkowsky (1990, p. 28), the parameter $\theta _ i$ is considered to be very close to linear, reasonably close, skewed, or quite nonlinear according to the absolute value of the Hougaard measure $|g_{1i}|$ being less than 0.1, between 0.1 and 0.25, between 0.25 and 1, or greater than 1, respectively.

Relative Curvature Measures of Nonlinearity

Bates and Watts (1980) formulated the maximum parameter-effects and maximum intrinsic curvature measures of nonlinearity to assess the close-to-linear behavior of nonlinear models. Ratkowsky (1990) notes that of the two curvature components in a nonlinear model, the parameter-effects curvature is typically larger. It is this component that you can affect by changing the parameterization of a model. PROC NLIN provides these two measures of curvature both through the STATS plot-option and through the NLINMEASURES option in the PROC NLIN statement.

The maximum parameter-effects and intrinsic curvatures are defined, in a compact form, as

$\displaystyle  \mb {T}^{\tau }  $
$\displaystyle = \text {max}_{\theta } \big {|}\big {|}\theta ^{} \mb {A}^{\tau } \theta \big {|}\big {|} $
$\displaystyle \mb {T}^{\eta }  $
$\displaystyle = \text {max}_{\theta } \big {|}\big {|}\theta ^{} \mb {A}^{\eta } \theta \big {|}\big {|}  $

where $\mb {T}^{\tau }$ and $\mb {T}^{\eta }$ denote the maximum parameter-effects and intrinsic curvatures, while $\mb {A}^{\tau }$ and $\mb {A}^{\eta }$ stand for the parameter-effects and intrinsic curvature arrays. The maximization is carried out over a unit-vector of the parameter values (Bates and Watts, 1980). In line with Bates and Watts (1980), PROC NLIN takes $10^{-4}$ as the convergence tolerance for the maximum intrinsic and parameter-effects curvatures. Note that the preceding matrix products involve contraction of the faces of the three-dimensional acceleration arrays with the normalized parameter vector, $\theta $. The corresponding expressions for the RMS (root mean square) parameter-effects and intrinsic curvatures can be found in Bates and Watts (1980).

The statistical significance of $\mb {T}^{\tau }$ and $\mb {T}^{\eta }$ and the corresponding RMS values can be assessed by comparing these values with $1/\sqrt {F}$, where F is the upper $\alpha \times 100\% $ quantile of an F distribution with p and $n-p$ degrees of freedom (Bates and Watts, 1980).

One motivation for fitting a nonlinear model in a different parameterization is to obtain a particular interpretation and to give parameter estimators more close-to-linear behavior. Example 63.4 shows how changing the parameterization of a four-parameter logistic model can reduce the parameter-effects curvature and can yield a useful parameter interpretation at the same time. In addition, Example 63.6 shows a nonlinear model with a high intrinsic curvature and the corresponding diagnostics.

Leverage in Nonlinear Regression

In contrast to linear regression, there are several measures of leverage in nonlinear regression. Furthermore, in nonlinear regression, the effect of a change in the ith response on the ith predicted value might depend on both the size of the change and the ith response itself (St. Laurent and Cook, 1992). As a result, some observations might show superleverage —namely, leverages in excess of one (St. Laurent and Cook, 1992).

PROC NLIN provides two measures of leverages: tangential and Jacobian leverages through the PLOTS option in the PROC NLIN statement and the H= and J= options of OUTPUT statement. Tangential leverage, $\mb {H}_{i}$, is based on approximating the nonlinear model with a linear model that parameterizes the tangent plane at the least squares parameter estimators. In contrast, Jacobian leverage, $\mb {J}_{i}$, is simply defined as the instantaneous rate of change in the ith predicted value with respect to the ith response (St. Laurent and Cook, 1992).

The mathematical formulas for tangential and Jacobian leverages are

$\displaystyle  \mb {H}_ i  $
$\displaystyle = w_ i \mb {x}_ i(\mb {X}^{\prime }\mb {WX})^{-1} \mb {x}_ i^{\prime } $
$\displaystyle \mb {J}_ i  $
$\displaystyle = w_ i \mb {x}_ i (\mb {X}^{\prime }\mb {WX}-[\mb {W\, e}][\mb {H}])^{-1} \mb {x}_ i^{\prime }\, ,  $

where $\mb {e}$ is the vector of residuals, $\mb {W}$ is the diagonal weight matrix if you specify the special variable _WEIGHT_ and otherwise the identity matrix, and i indexes the corresponding quantities for the ith observation. The brackets $[.][.]$ indicate column multiplication as defined in Bates and Watts (1980). The preceding formula for tangential leverage holds if the gradient, Marquardt, or Gauss methods are used. For the Newton method, the tangential leverage is set equal to the Jacobian leverage.

In a model with a large intrinsic curvature, the Jacobian and tangential leverages can be very different. In fact, the two leverages are identical only if the model provides an exact fit to the data ($\mb {e} = 0$) or the model is intrinsically linear (St. Laurent and Cook, 1993). This is also illustrated by the leverage plot and nonlinearity measures provided in Example 63.6.

Local Influence in Nonlinear Regression

St. Laurent and Cook (1993) suggest using $l_{\hbox{max}}$, the direction that yields the maximum normal curvature, to assess the local influence of an additive perturbation to the response variable on the estimation of the parameters and variance of a nonlinear model. Defining the normal curvature components

$\displaystyle  \mb {C}_\theta  $
$\displaystyle = \hbox{max}_ l \,  \frac{2}{\sigma ^2} l’ \mb {J} l $
$\displaystyle \mb {C}_\sigma  $
$\displaystyle = \hbox{max}_ l\, \frac{4}{\sigma ^2} l’ \mb {P}_ e l  $

where $\mb {J}$ is the Jacobian leverage matrix and $\mb {P}_ e = \mb {e}\mb {e}’/(\mb {e}’\mb {e})$, you choose the $l_{\hbox{max}}$ that results in the maximum of the two curvature components (St. Laurent and Cook, 1993). PROC NLIN provides $l_{\hbox{max}}$ through the PLOTS option in the PROC NLIN statement and the LMAX= option in the OUTPUT statement. Example 63.6 shows a plot of $l_{\hbox{max}}$ for a model with high intrinsic curvature.

Residuals in Nonlinear Regression

If a nonlinear model is intrinsically nonlinear, using the residuals $\mb {e}=\mb {y} - \mb {\hat{y}}$ for diagnostics can be misleading (Cook and Tsai, 1985). This is due to the fact that in correctly specified intrinsically nonlinear models, the residuals have nonzero means and different variances, even when the original error terms have identical variances. Furthermore, the covariance between the residuals and the predicted values tends to be negative semidefinite, complicating the interpretation of plots based on $\mb {e}$ (Cook and Tsai, 1985).

Projected residuals are proposed by Cook and Tsai (1985) to overcome these shortcomings of residuals, which are henceforth called raw (ordinary) residuals to differentiate them from their projected counterparts. Projected residuals have zero means and are uncorrelated with the predicted values. In fact, projected residuals are identical to the raw residuals in intrinsically linear models.

PROC NLIN provides raw and projected residuals, along with their standardized forms. In addition, the mean or expectation of the raw residuals is available. These can be accessed with the PLOTS option in the PROC NLIN statement and the OUTPUT statement options PROJRES=, PROJSTUDENT=, RESEXPEC=, RESIDUAL= and STUDENT=.

Denote the projected residuals by $\mb {e}_{p}$ and the expectation of the raw residuals by $\mr {E}[\mb {e}]$. Then

$\displaystyle  \mb {e}_{p}  $
$\displaystyle = \left(\mb {I}_ n - \mb {P}_{xh}\right)\mb {e}\,  $
$\displaystyle \mr {E} \left[\mb {e}_ i\right]  $
$\displaystyle = -\frac{\sigma ^2}{2}\,  \sum ^{n}_{j=1} \tilde{\mb {P}}_{x,\,  ij} \text {Tr}\left(\,  \left[\mb {H}_ j\right] \, \left(\mb {X}’\mb {X}\right)^{-1}\, \right)  $

where $\mb {e}_ i$ is the ith observation raw residual, $\mb {I}_ n$ is an n-dimensional identity matrix, $\mb {P}_{xh}$ is the projector onto the column space of $\left(\mb {X}|\mb {H}\right)$, and $\tilde{\mb {P}}_ x = \mb {I}_ n - \mb {P}_ x$. The preceding formulas are general with the projectors defined accordingly to take the weighting into consideration. In unweighted least squares, $\mr {E} \left[\mb {e}\right]$ reduces to

$\displaystyle  \mr {E} \left[\mb {e}\right]  $
$\displaystyle = -\frac{1}{2} \sigma ^2 \tilde{\mb {Q}}\,  \mb {a}\,   $

with $\tilde{\mb {Q}}$ being the last $n-p$ columns of the $\mb {Q}$ matrix in the QR decomposition of $\mb {X}$ and the $(n-p)$ dimensional vector $\mb {a}$ being defined in terms of the intrinsic acceleration array

$\displaystyle  \mb {a}_ i  $
$\displaystyle = \sum ^{p}_{j=1}\left[\mb {A}_{i+p}\right]_{jj}  $

Standardization of the projected residuals requires the variance of the projected residuals. This is estimated using the formula (Cook and Tsai, 1985)

$\displaystyle  \sigma ^2_{p}  $
$\displaystyle = \frac{\mb {e}_{p} \mb {e}_{p}}{\text {Tr}\left(\mb {I}_ n - \mb {P}_{xh}\right)}  $

The standardized raw and projected residuals, denoted by $\tilde{\mb {e}}$ and $\tilde{\mb {e}}_{p}$ respectively, are given by

$\displaystyle  \tilde{\mb {e}}  $
$\displaystyle = \frac{\sqrt {w_ i}\,  \mb {e}}{ \sigma \sqrt {1 - \mb {P}_{x,ii}}} $
$\displaystyle \tilde{\mb {e}}_{p}  $
$\displaystyle = \frac{\sqrt {w_ i} \, \mb {e}_ p}{ \sigma \sqrt {1 - \mb {P}_{xh,ii}}}  $

The use of raw and projected residuals for diagnostics in nonlinear regression is illustrated in Example 63.6.

Profiling Parameters and Assessing the Influence of Observations on Parameter Estimates

The global measures of nonlinearity, discussed in the preceding section, are very useful for assessing the overall nonlinearity of the model. However, the impact of global nonlinearity on inference regarding subsets of the parameter set cannot be easily determined (Cook and Tsai, 1985). The impact of the nonlinearity on the uncertainty of individual parameters can be efficiently described by profile t plots and confidence curves (Bates and Watts, 1988; Cook and Weisberg, 1990).

A profile t plot for parameter $\theta $ is a plot of the likelihood ratio pivotal statistic, $\tau (\theta )$, and the corresponding Wald pivotal statistic, $\sigma (\theta )$ (Bates and Watts, 1988). The likelihood ratio pivotal statistic is defined as

\[  \tau (\theta ) = \mbox{sign}(\theta - \hat{\theta })\,  \mbox{L}(\theta )  \]

with

\[  \mbox{L}(\theta ) = \left( \frac{\mbox{SSE}(\theta , \tilde{\Theta }) - \mbox{SSE}(\hat{\theta }, \hat{\Theta }) }{\mbox{mse}} \right)^{1/2}  \]

where $\theta $ is the profile parameter and $\Theta $ refers to the remaining parameters. $\mr {SSE}(\theta , \tilde{\Theta })$ is the sum of square errors where the profile parameter $\theta $ is constrained at a given value and $\tilde{\Theta }$ is the least squares estimate of $\Theta $ conditional on a given value of $\theta $. In contrast, $\mr {SSE}(\hat{\theta }, \hat{\Theta })$ is the sum of square errors for the full model. For linear models, $\tau (\theta )$ matches $\sigma (\theta )$, which is defined as

\[  \sigma (\theta ) = \frac{\theta - \hat{\theta }}{\mbox{stderr}_\theta }  \]

where $\theta $ is the constrained value, $\hat{\theta }$ is the estimated value for the parameter, and $\mbox{stderr}_\theta $ is the standard error for the parameter. Usually a profile t plot is overlaid with a reference line that passes through the origin and has a slope of 1. PROC NLIN follows this convention.

A confidence curve for a particular parameter is useful for validating Wald-based confidence intervals for the parameter against likelihood-based confidence intervals (Cook and Weisberg, 1990). A confidence curve contains a scatter plot of the constrained parameter value versus $\mr {L}(\theta )$. The Wald-based confidence intervals are overlaid as two straight lines that pass through $(0, \hat{\theta })$ with a slope of $\pm \mbox{stderr}_\theta $. Hence, for different levels of significance, you can easily compare the Wald-based confidence intervals against the corresponding confidence intervals that are based on the likelihood ratio. Cook and Weisberg (1990) recommend that you report a single Wald-based confidence interval only if there is a good agreement between the Wald and likelihood intervals up to at least the 99% confidence level.

Compared to local influence, the leave-one-out method performs a more complete analysis of the influence of observations on the values of parameter estimates. In this method, jackknife resampling removes each observation in turn and fits the model to the remaining data set. Hence, a data set with n observations will have n corresponding data sets with n–1 observations. The impact of each observation on a parameter is assessed by the absolute relative percentage change in the value of the parameter compared with the reference value from the full data.