The TPSPLINE Procedure

Penalized Least Squares Estimation

Penalized least squares estimation provides a way to balance fitting the data closely and avoiding excessive roughness or rapid variation. A penalized least squares estimate is a surface that minimizes the penalized squared error over the class of all surfaces that satisfy sufficient regularity conditions.

Define ${\mb{x}}_ i$ as a d-dimensional covariate vector from an $n\times d$ matrix $\mb{X}$ , ${\mb{z}}_ i$ as a p-dimensional covariate vector, and $y_ i$ as the observation associated with $({\mb{x}}_ i, {\mb{z}}_ i)$ . Assuming that the relation between ${\mb{z}}_ i$ and $y_ i$ is linear but the relation between ${\mb{x}}_ i$ and $y_ i$ is unknown, you can fit the data by using a semiparametric model as follows:

$y_ i=f({\mb{x}}_ i)+{\mb{z}}_ i{\bbeta } +\epsilon _ i$

where f is an unknown function that is assumed to be reasonably smooth, $\epsilon _ i, i=1,\cdots ,n$ , are independent, zero-mean random errors, and ${\bbeta }$ is a p-dimensional unknown parameter vector.

This model consists of two parts. The ${\mb{z}}_ i{\bbeta }$ is the parametric part of the model, and the ${\mb{z}}_ i$ are the regression variables. The $f({\mb{x}}_ i)$ is the nonparametric part of the model, and the ${\mb{x}}_ i$ are the smoothing variables. The ordinary least squares method estimates $f(\mb{x}_ i)$ and ${\bbeta }$ by minimizing the quantity:

$\frac{1}{n} \sum ^ n_{i=1}(y_ i-f({\mb{x}}_ i)-{\mb{z}}_ i{\bbeta })^2$

However, the functional space of $f(\mb{x})$ is so large that you can always find a function f that interpolates the data points. In order to obtain an estimate that fits the data well and has some degree of smoothness, you can use the penalized least squares method.

The penalized least squares function is defined as

$S_\lambda (f)=\frac{1}{n} \sum ^ n_{i=1} \left(y_ i-f({\mb{x}}_ i)-{\mb{z}}_ i{\bbeta }\right)^2 + \lambda J_2(f)$

where $J_2(f)$ is the penalty on the roughness of f and is defined, in most cases, as the integral of the square of the second derivative of f.

The first term measures the goodness of fit and the second term measures the smoothness associated with f. The $\lambda$ term is the smoothing parameter, which governs the tradeoff between smoothness and goodness of fit. When $\lambda$ is large, it more heavily penalizes rougher fits. Conversely, a small value of $\lambda$ puts more emphasis on the goodness of fit.

The estimate $f_\lambda$ is selected from a reproducing kernel Hilbert space, and it can be represented as a linear combination of a sequence of basis functions. Hence, the final estimates of f can be written as

$\hat{f}_\lambda ({\mb{x}}_ i)=\theta _0+\sum _{j=1}^ d \theta _ j {\mb{x}_{i}}_ j+\sum _{j=1}^ p \delta _ j B_ j({\mb{x}}_{j})$

where $B_ j$ is the basis function, which depends on where the data ${\mb{x}}_{i}$ are located, and $\btheta =\{ \theta _0,\dots ,\theta _ d\}$ and $\bdelta =\{ \delta _1,\dots ,\delta _ p\}$ are the coefficients that need to be estimated.

For a fixed $\lambda$ , the coefficients $(\btheta , \bdelta , \bbeta )$ can be estimated by solving an $n\times n$ system.

The smoothing parameter can be chosen by minimizing the generalized cross validation (GCV) function.

If you write

$\hat{\mb{y}}={\mb{A}}(\lambda ) {\mb{y}}$

then $\mb{A}(\lambda )$ is referred to as the hat or smoothing matrix, and the GCV function $\mbox{GCV}(\lambda )$ is defined as

$\mbox{GCV}(\lambda )=\frac{(1/n)\| ({\mb{I}}-{\mb{A}}(\lambda )){\mb{y}}\| ^2}{[(1/n)\mr{tr}({\mb{I}}-{\mb{A}}(\lambda ))]^2}$