The PLS procedure in SAS/STAT software fits models using any one of a number of linear predictive methods, including partial least squares (PLS). Ordinary least squares regression, as implemented in SAS/STAT procedures such as PROC GLM and PROC REG, has the single goal of minimizing sample response prediction error, seeking linear functions of the predictors that explain as much variation in each response as possible. The techniques implemented in the PLS procedure have the additional goal of accounting for variation in the predictors, under the assumption that directions in the predictor space that are well sampled should provide better prediction for new observations when the predictors are highly correlated.
All of the techniques implemented in the PLS procedure work by extracting successive linear combinations of the predictors, called factors (also called components or latent vectors), which optimally address one or both of these two goals-explaining response variation and explaining predictor variation. In particular, the method of partial least squares balances the two objectives, seeking for factors that explain both response and predictor variation.
The techniques implemented by the PLS procedure are
principal components regression, which extracts factors to explain as much predictor sample variation as possible.
reduced rank regression, which extracts factors to explain as much response variation as possible. This technique, also known as (maximum) redundancy analysis, differs from multivariate linear regression only when there are multiple responses.
partial least squares regression, which balances the two objectives of explaining response variation and explaining predictor variation.
Two different formulations for partial least squares are available: the original method of Wold (1966) and the SIMPLS method of de Jong (1993). The number of factors to extract depends on the data. Basing the model on more extracted factors improves the model fit to the observed data, but extracting too many factors can cause over-fitting, that is, tailoring the model too much to the current data, to the detriment of future predictions. The PLS procedure enables you to choose the number of extracted factors by cross validation, that is, fitting the model to part of the data and minimizing the prediction error for the unfitted part. Various methods of cross validation are available, including one-at-a-time validation, splitting the data into blocks, and test set validation.
You can use the general linear modeling approach of the GLM procedure to specify a model for your design, allowing for general polynomial effects as well as classification or ANOVA effects. You can save the model fit by the PLS procedure in a data set and apply it to new data by using the SCORE procedure.
Note that the name "partial least squares" also applies to a more general statistical method that is not implemented in this procedure. The partial least squares method was originally developed in the 1960s by the econometrician Herman Wold (1966) for modeling "paths" of causal relation between any number of "blocks" of variables. However, the PLS procedure fits only predictive partial least squares models, with one "block" of predictors and one "block" of responses. If you are interested in fitting more general path models, you should consider using the CALIS procedure.
The document An Introduction to Partial Least Squares describes the methodology and includes an appendix with the procedure's syntax. Examples Using the PLS Procedure describes several applications, and macros for plotting various statistics produced by the PLS procedure are available for download from Tech Support.
Statistics and Operations Research Home Page | What's New in Data Analysis