The TPSPLINE Procedure

Computational Formulas

Subsections:

Confidence Intervals
Smoothing Parameter

The theoretical foundations for the thin-plate smoothing spline are described in Duchon (1976, 1977) and Meinguet (1979). Further results and applications are given in: Wahba and Wendelberger (1980); Hutchinson and Bischof (1983); Seaman and Hutchinson (1985).

Suppose that $\mathcal{H}_ m$ is a space of functions whose partial derivatives of total order m are in $L_2(E^ d)$ , where $E^ d$ is the domain of $\mb{x}$ .

Now, consider the data model

$y_ i = f(\mb{x}_ i)+\epsilon _ i, i=1, \dots , n$

where $f \in \mathcal{H}_ m$ .

Using the notation from the section Penalized Least Squares Estimation, for a fixed $\lambda$ , estimate f by minimizing the penalized least squares function

$\frac{1}{n} \sum ^ n_{i=1}(y_ i-f({\mb{x}}_ i)-{\mb{z}}_ i{\bbeta })^2 + \lambda J_ m(f)$

$\lambda J_ m(f)$ is the penalty term to enforce smoothness on f. There are several ways to define $J_ m(f)$ . For the thin-plate smoothing spline, with $\mb{x}=(x_1,\dots ,x_ d)$ of dimension d, define $J_ m(f)$ as

$J_ m(f)= \int _{-\infty }^{\infty } \cdots \int _{-\infty }^{\infty } \sum \frac{m!}{\alpha _1!\cdots \alpha _ d !} \left( \begin{array}{c} \frac{\partial ^ m f}{\partial {x_1}^{\alpha _1}\cdots \partial {x_ d}^{\alpha _ d}} \end{array}\right)^2 \mr{d}x_1\cdots \mr{d}x_ d$

where $\sum _ i \alpha _ i=m$ . Under this definition, $J_ m(f)$ gives zero penalty to some functions. The space that is spanned by the set of polynomials that contribute zero penalty is called the polynomial space. The dimension of the polynomial space M is a function of dimension d and order m of the smoothing penalty, $M={m+d-1\choose d}$ .

Given the condition that $2m>d$ , the function that minimizes the penalized least squares criterion has the form

$\hat{f}(\mb{x})=\sum _{j=1}^ M\theta _ j\phi _ j(\mb{x})+ \sum _{i=1}^ n\delta _ i\eta _{md}(\| \mb{x}-\mb{x}_ i\| )$

where $\btheta$ and $\bdelta$ are vectors of coefficients to be estimated. The M functions $\phi _ j$ are linearly independent polynomials that span the space of functions for which $J_ m(f)$ is zero. The basis functions $\eta _{md}$ are defined as

$\begin{equation*} \eta _{md}(r) = \begin{cases} \frac{(-1)^{m+1+d/2}}{2^{2m-1}\pi ^{d/2}(m-1)!(m-d/2)!}r^{2m-d}\log (r) & \text {if }\Mathtext{d}i\text { is even}\\ \frac{\Gamma (d/2-m)}{2^{2m}\pi ^{d/2}(m-1)!}r^{2m-d} & \text {if }\Mathtext{d}\text { is odd} \end{cases}\end{equation*}$

When d = 2 and m = 2, then $M={3\choose 2}=3$ , $\phi _1(\mb{x})=1$ , $\phi _2(\mb{x})=x_1$ , and $\phi _3(\mb{x})=x_2$ . $J_ m(f)$ is as follows:

$J_2(f)=\int _{-\infty } ^{\infty } \int _{-\infty }^{\infty } \left(\left( \begin{array}{c} \frac{\partial ^2 f}{\partial {x_1}^2} \end{array}\right)^2 +2 \left( \begin{array}{c} \frac{\partial ^2 f}{\partial {x_1} \partial {x_2}} \end{array}\right)^2 + \left( \begin{array}{c} \frac{\partial ^2 f}{\partial {x_2}^2} \end{array}\right)^2\right) \mr{d}x_1\mr{d}x_2$

For the sake of simplicity, the formulas and equations that follow assume m = 2. See Wahba (1990) and Bates et al. (1987) for more details.

Duchon (1976) showed that $f_\lambda$ can be represented as

$f_\lambda (\mb{x}_ i)=\theta _0+\sum _{j=1}^{d} \theta _ j {\mb{x}_{i}}_ j+ \sum _{j=1}^ n\delta _ j E_2(\mb{x}_ i-\mb{x}_ j)$

where $E_2(\mb{s})=\frac{1}{2^3 \pi } \| \mb{s}\| ^2~ \log (\| \mb{s}\| )$ for d = 2. For derivations of $E_2(\mb{s})$ for other values of d, see Villalobos and Wahba (1987).

If you define ${\mb{K}}$ with elements $\mb{K}_{ij} =E_2(\mb{x}_ i-\mb{x}_ j)$ and ${\mb{T}}$ with elements $\mb{T}_{ij} =(\mb{X}_{ij})$ , the goal is to find vectors of coefficients $\bbeta ,\btheta ,$ and $\bdelta$ that minimize

$S_\lambda (\bbeta , \btheta ,\bdelta )= \frac{1}{n} \| \mb{y}-\mb{T}\btheta - \mb{K}\bdelta -\mb{Z}\bbeta \| ^2 +\lambda \bdelta ’ \mb{K} \bdelta$

A unique solution is guaranteed if the matrix $\mb{T}$ is of full rank and $\bdelta ’ \mb{K} \bdelta \ge 0$ .

If $\balpha = \left( \begin{array}{c} \btheta \\ \bbeta \end{array} \right)$ and $\mb{X}=(\mb{T} ~ \mb{Z})$ , the expression for $S_\lambda$ becomes

$\frac{1}{n}\| \mb{y}-\mb{X}\balpha - \mb{K}\bdelta \| ^2 +\lambda \bdelta ’ \mb{K} \bdelta$

The coefficients $\alpha$ and $\delta$ can be obtained by solving

$\left. \begin{array}{rcl} (\mb{K}+n\lambda \mb{I}_ n) \bdelta +\mb{X}\balpha & = & \mb{y} \\ \mb{X}’ \bdelta & = & \mb{0} \end{array} \right.$

To compute $\balpha$ and $\bdelta$ , let the QR decomposition of $\mb{X}$ be

$\mb{X} = (\mb{Q}_1 ~ \mb{Q}_2) \left( \begin{array}{c} \mb{R} \\ \mb{0} \end{array} \right)$

where $(\mb{Q}_1 ~ \mb{Q}_2)$ is an orthogonal matrix and $\mb{R}$ is an upper triangular, with $\mb{X}’ \mb{Q}_2 = 0$ (Dongarra et al., 1979).

Since $\mb{X}’ \bdelta = 0$ , $\bdelta$ must be in the column space of $\mb{Q}_2$ . Therefore, $\bdelta$ can be expressed as $\delta = \mb{Q}_2 \bgamma$ for a vector $\bgamma$ . Substituting $\bdelta = \mb{Q}_2 \bgamma$ into the preceding equation and multiplying through by $\mb{Q}_2 ’$ gives

$\mb{Q}_2 ’ (\mb{K}+n\lambda \mb{I})\mb{Q}_2 \bgamma = \mb{Q}_2’ \mb{y}$

$\bdelta = \mb{Q}_2 \bgamma = \mb{Q}_2[\mb{Q}_2’(\mb{K}+n\lambda \mb{I})\mb{Q}_2]^{-1} \mb{Q}_2’ \mb{y}$

The coefficient $\balpha$ can be obtained by solving

$\mb{R}\balpha = \mb{Q}_1’[ \mb{y}- (\mb{K}+n\lambda \mb{I})\bdelta ]$

The influence matrix $\mb{A}(\lambda )$ is defined as

$\hat{\mb{y}}=\mb{A}(\lambda ) \mb{y}$

and has the form

$\mb{A}(\lambda ) = \mb{I} - n\lambda \mb{Q}_2 [\mb{Q}_2’(\mb{K}+n\lambda \mb{I})\mb{Q}_2]^{-1} \mb{Q}_2’$

Similar to the regression case, if you consider the trace of $\mb{A}(\lambda )$ as the degrees of freedom for the model and the trace of $(\mb{I} - \mb{A}(\lambda ))$ as the degrees of freedom for the error, the estimate $\sigma ^2$ can be represented as

$\hat{\sigma }^2 = \frac{\mr{RSS}(\lambda )}{\mr{tr}(\mb{I} - \mb{A}(\lambda ))}$

where $\mr{RSS}(\lambda )$ is the residual sum of squares. Theoretical properties of these estimates have not yet been published. However, good numerical results in simulation studies have been described by several authors. For more information, see O’Sullivan and Wong (1987); Nychka (1986a, 1986b, 1988); Hall and Titterington (1987).

Confidence Intervals

Viewing the spline model as a Bayesian model, Wahba (1983) proposed Bayesian confidence intervals for smoothing spline estimates as

$\hat{f}_\lambda (\mb{x}_ i) \pm z_{\alpha /2} \sqrt {\hat{\sigma }^2 a_{ii}(\lambda )}$

where $a_{ii}(\lambda )$ is the ith diagonal element of the $\mb{A}(\lambda )$ matrix and $z_{\alpha /2}$ is the $1-\alpha /2$ quantile of the standard normal distribution. The confidence intervals are interpreted as intervals "across the function" as opposed to pointwise intervals.

For SCORE data sets, the hat matrix $\mb{A}(\lambda )$ is not available. To compute the Bayesian confidence interval for a new point $\mb{x}_{\mathrm{new}}$ , let

$\mb{S}=\mb{X}, ~ \mb{M}=\mb{K}+n\lambda \mb{I}$

and let $\bxi$ be an $n\times 1$ vector with ith entry

$\eta _{md}(\| \mb{x}_{\mathrm{new}}-\mb{x}_ i\| )$

When d = 2 and m = 2, $\xi _ i$ is computed with

$E_2(\mb{x}_ i-\mb{x}_{\mathrm{new}})= \frac{1}{2^3\pi }\| \mb{x}_ i-\mb{x}_{\mathrm{new}}\| ^2 \log (\| \mb{x}_ i-\mb{x}_{\mathrm{new}}\| )$

$\bphi$ is a vector of evaluations of $\mb{x}_{\mathrm{new}}$ by the polynomials that span the functional space where $J_ m(f)$ is zero. The details for $\mb{X}$ , $\mb{K}$ , and $E_2$ are discussed in the previous section. Wahba (1983) showed that the Bayesian posterior variance of $\mb{x}_{\mathrm{new}}$ satisfies

$n\lambda \mathrm{Var}(\mb{x}_{\mathrm{new}}) = \bphi ’(\mb{S}’\mb{M}^{-1}\mb{S})^{-1}\bphi - 2\bphi ’\mb{d}_{\xi } - \bxi ’\mb{c}_{\xi }$

where

$\begin{eqnarray*} \mb{c}_{\xi } & =& (\mb{M}^{-1}-\mb{M}^{-1}\mb{S}(\mb{S}’\mb{M}^{-1}\mb{S})^{-1} \mb{S}’\mb{M}^{-1})\bxi \\ \mb{d}_{\xi } & =& (\mb{S}’\mb{M}^{-1}\mb{S})^{-1}\mb{S}’\mb{M}^{-1}\bxi \end{eqnarray*}$

Suppose that you fit a spline estimate that consists of a true function f and a random error term $\epsilon _ i$ to experimental data. In repeated experiments, it is likely that about $100(1-\alpha )\%$ of the confidence intervals cover the corresponding true values, although some values are covered every time and other values are not covered by the confidence intervals most of the time. This effect is more pronounced when the true surface or surface has small regions of particularly rapid change.

Smoothing Parameter

The quantity $\lambda$ is called the smoothing parameter, which controls the balance between the goodness of fit and the smoothness of the final estimate.

A large $\lambda$ heavily penalizes the mth derivative of the function, thus forcing $f^{(m)}$ close to 0. A small $\lambda$ places less of a penalty on rapid change in $f^{(m)}(\mb{x})$ , resulting in an estimate that tends to interpolate the data points.

The smoothing parameter greatly affects the analysis, and it should be selected with care. One method is to perform several analyses with different values for $\lambda$ and compare the resulting final estimates.

A more objective way to select the smoothing parameter $\lambda$ is to use the "leave-out-one" cross validation function, which is an approximation of the predicted mean squares error. A generalized version of the leave-out-one cross validation function is proposed by Wahba (1990) and is easy to calculate. This generalized cross validation (GCV) function is defined as

$\mbox{GCV}(\lambda )=\frac{(1/n)\| ({\mb{I}}-{\mb{A}}(\lambda )){\mb{y}}\| ^2}{[(1/n)\mr{tr}({\mb{I}}-{\mb{A}}(\lambda ))]^2}$

The justification for using the GCV function to select $\lambda$ relies on asymptotic theory. Thus, you cannot expect good results for very small sample sizes or when there is not enough information in the data to separate the model from the error component. Simulation studies suggest that for independent and identically distributed Gaussian noise, you can obtain reliable estimates of $\lambda$ for n greater than 25 or 30. Note that, even for large values of n (say, $n \ge 50$ ), in extreme Monte Carlo simulations there might be a small percentage of unwarranted extreme estimates in which $\hat{\lambda } = 0$ or $\hat{\lambda } = \infty$ (Wahba, 1983). Generally, if $\sigma ^2$ is known to within an order of magnitude, the occasional extreme case can be readily identified. As n gets larger, the effect becomes weaker.

The GCV function is fairly robust against nonhomogeneity of variances and non-Gaussian errors (Villalobos and Wahba, 1987). Andrews (1988) has provided favorable theoretical results when variances are unequal. However, this selection method is likely to give unsatisfactory results when the errors are highly correlated.

The GCV value might be suspect when $\lambda$ is extremely small because computed values might become indistinguishable from zero. In practice, calculations with $\lambda = 0$ or $\lambda$ near 0 can cause numerical instabilities that result in an unsatisfactory solution. Simulation studies have shown that a $\lambda$ with $\log _{10}(n\lambda ) > -8$ is small enough that the final estimate based on this $\lambda$ almost interpolates the data points. A GCV value based on a $\lambda \le 10^{-8}$ might not be accurate.