Penalized least squares estimation provides a way to balance fitting the data closely and avoiding excessive roughness or rapid variation. A penalized least squares estimate is a surface that minimizes the penalized squared error over the class of all surfaces that satisfy sufficient regularity conditions.
Define  as a d-dimensional covariate vector from an
 as a d-dimensional covariate vector from an  matrix
 matrix  ,
,  as a p-dimensional covariate vector, and
 as a p-dimensional covariate vector, and  as the observation associated with
 as the observation associated with  . Assuming that the relation between
. Assuming that the relation between  and
 and  is linear but the relation between
 is linear but the relation between  and
 and  is unknown, you can fit the data by using a semiparametric model as follows:
 is unknown, you can fit the data by using a semiparametric model as follows: 
            
![\[  y_ i=f({\mb{x}}_ i)+{\mb{z}}_ i{\bbeta } +\epsilon _ i  \]](images/statug_tpspline0007.png)
 where f is an unknown function that is assumed to be reasonably smooth,  , are independent, zero-mean random errors, and
, are independent, zero-mean random errors, and  is a p-dimensional unknown parameter vector.
 is a p-dimensional unknown parameter vector. 
            
This model consists of two parts. The  is the parametric part of the model, and the
 is the parametric part of the model, and the  are the regression variables. The
 are the regression variables. The  is the nonparametric part of the model, and the
 is the nonparametric part of the model, and the  are the smoothing variables. The ordinary least squares method estimates
 are the smoothing variables. The ordinary least squares method estimates  and
 and  by minimizing the quantity:
 by minimizing the quantity: 
            
![\[  \frac{1}{n} \sum ^ n_{i=1}(y_ i-f({\mb{x}}_ i)-{\mb{z}}_ i{\bbeta })^2  \]](images/statug_tpspline0013.png)
However, the functional space of  is so large that you can always find a function f that interpolates the data points. In order to obtain an estimate that fits the data well and has some degree of smoothness,
               you can use the penalized least squares method.
 is so large that you can always find a function f that interpolates the data points. In order to obtain an estimate that fits the data well and has some degree of smoothness,
               you can use the penalized least squares method. 
            
The penalized least squares function is defined as
![\[  S_\lambda (f)=\frac{1}{n} \sum ^ n_{i=1} \left(y_ i-f({\mb{x}}_ i)-{\mb{z}}_ i{\bbeta }\right)^2 + \lambda J_2(f)  \]](images/statug_tpspline0015.png)
 where  is the penalty on the roughness of f and is defined, in most cases, as the integral of the square of the second derivative of f.
 is the penalty on the roughness of f and is defined, in most cases, as the integral of the square of the second derivative of f. 
            
The first term measures the goodness of fit and the second term measures the smoothness associated with f. The  term is the smoothing parameter, which governs the tradeoff between smoothness and goodness of fit. When
 term is the smoothing parameter, which governs the tradeoff between smoothness and goodness of fit. When  is large, it more heavily penalizes rougher fits. Conversely, a small value of
 is large, it more heavily penalizes rougher fits. Conversely, a small value of  puts more emphasis on the goodness of fit.
 puts more emphasis on the goodness of fit. 
            
The estimate  is selected from a reproducing kernel Hilbert space, and it can be represented as a linear combination of a sequence of basis
               functions. Hence, the final estimates of f can be written as
 is selected from a reproducing kernel Hilbert space, and it can be represented as a linear combination of a sequence of basis
               functions. Hence, the final estimates of f can be written as 
            
![\[  \hat{f}_\lambda ({\mb{x}}_ i)=\theta _0+\sum _{j=1}^ d \theta _ j {\mb{x}_{i}}_ j+\sum _{j=1}^ p \delta _ j B_ j({\mb{x}}_{j})  \]](images/statug_tpspline0019.png)
where  is the basis function, which depends on where the data
 is the basis function, which depends on where the data  are located, and
 are located, and  and
 and  are the coefficients that need to be estimated.
 are the coefficients that need to be estimated. 
            
For a fixed  , the coefficients
, the coefficients  can be estimated by solving an
 can be estimated by solving an  system.
 system. 
            
The smoothing parameter can be chosen by minimizing the generalized cross validation (GCV) function.
If you write
![\[  \hat{\mb{y}}={\mb{A}}(\lambda ) {\mb{y}}  \]](images/statug_tpspline0026.png)
 then  is referred to as the hat or smoothing matrix, and the GCV function
 is referred to as the hat or smoothing matrix, and the GCV function  is defined as
 is defined as 
            
![\[  \mbox{GCV}(\lambda )=\frac{(1/n)\| ({\mb{I}}-{\mb{A}}(\lambda )){\mb{y}}\| ^2}{[(1/n)\mr{tr}({\mb{I}}-{\mb{A}}(\lambda ))]^2}  \]](images/statug_tpspline0029.png)