The HPPLS Procedure

PROC HPPLS Statement

  • PROC HPPLS <options>;

The PROC HPPLS statement invokes the HPPLS procedure. Table 57.1 summarizes the options available in the PROC HPPLS statement.

Table 57.1: PROC HPPLS Statement Options

Option

Description

Basic Options

DATA=

Specifies the input data set

NAMELEN=

Limits the length of effect names

Model Fitting Options

CVTEST

Requests that van der Voet’s (1994) randomization-based model comparison test be performed

METHOD=

Specifies the general factor extraction method to be used

NFAC=

Specifies the number of factors to extract

NOCENTER

Suppresses centering of the responses and predictors before fitting

NOCVSTDIZE

Suppresses re-centering and rescaling of the responses and predictors when cross-validating

NOSCALE

Suppresses scaling of the responses and predictors before fitting

Output Options

CENSCALE

Displays the centering and scaling information

DETAILS

Displays the details of the fitted model

NOCLPRINT

Limits or suppresses the display of class levels

NOPRINT

Suppresses ODS output

VARSS

Displays the amount of variation accounted for in each response and predictor


The following list provides details about these options.

CENSCALE

lists the centering and scaling information for each response and predictor.

CVTEST <(cvtest-options)>

requests that van der Voet’s (1994) randomization-based model comparison test be performed to test models that have different numbers of extracted factors against the model that minimizes the predicted residual sum of squares. For more information, see the section Test Set Validation. You can also specify the following cvtest-options in parentheses after the CVTEST option:

PVAL=n

specifies the cutoff probability for declaring an insignificant difference. By default, PVAL=0.10.

STAT=PRESS | T2

specifies the test statistic for the model comparison. You can specify either T2, for Hotelling’s $T^2$ statistic, or PRESS, for the predicted residual sum of squares. By default, STAT=T2.

NSAMP=n

specifies the number of randomizations to perform. By default, NSAMP=1000.

SEED=n

specifies the seed value for the random number stream. If you do not specify a seed, or if you specify a value less than or equal to 0, the seed is generated from reading the time of day from the computer’s clock.

Analyses that use the same (nonzero) seed are not completely reproducible if they are executed on a different number of threads because the random number streams in separate threads are independent. You can control the number of threads on which the HPPLS procedure executes by using SAS system options or by using the PERFORMANCE statement in the HPPLS procedure.

DATA=SAS-data-set

names the input SAS data set to be used by PROC HPPLS. The default is the most recently created data set.

If PROC HPPLS executes in distributed mode, the input data are distributed to memory on the appliance nodes and analyzed in parallel, unless the data are already distributed in the appliance database. In that case PROC HPPLS reads the data alongside the distributed database. For more information about the various execution modes, see the section Processing Modes in SAS/STAT 14.1 User's Guide: High-Performance Procedures. For more information about the alongside-the-database model, see the section Alongside-the-Database Execution in SAS/STAT 14.1 User's Guide: High-Performance Procedures.

DETAILS

lists the details of the fitted model for each successive factor. The listed details are different for different extraction methods. For more information, see the section Displayed Output.

METHOD=PLS<(PLS-options)> | SIMPLS | PCR | RRR

specifies the general factor extraction method to be used. You can specify the following values:

PCR

requests principal components regression.

PLS<(PLS-options)>

requests partial least squares. You can also specify the following optional PLS-options in parentheses after METHOD=PLS:

ALGORITHM=NIPALS | SVD | EIG

names the specific algorithm used to compute extracted PLS factors. NIPALS requests the usual iterative NIPALS algorithm, SVD bases the extraction on the singular value decomposition of $\bX ’\bY $, and EIG bases the extraction on the eigenvalue decomposition of $\bY ’\bX \bX ’\bY $. ALGORITHM=SVD is the most accurate but least efficient approach. By default, ALGORITHM=NIPALS.

EPSILON=n

specifies the convergence criterion for the NIPALS algorithm. By default, EPSILON=$10^{-12}$.

MAXITER=n

specifies the maximum number of iterations for the NIPALS algorithm. By default, MAXITER=200.

RRR

requests reduced rank regression.

SIMPLS

requests the straightforward implementation of a statistically inspired modification of the partial least squares (SIMPLS) method of De Jong (1993).

By default, METHOD=PLS.

NAMELEN=number

specifies the length to which long effect names are shortened. By default, NAMELEN=20. If you specify a value less than 20 for number, the default is used.

NFAC=n

specifies the number of factors to extract. The default is $\min \{ 15,p,N\} $, where p is the number of predictors (or the number of dependent variables when METHOD=RRR) and N is the number of runs (observations). You probably do not need to extract this many factors for most applications. Extracting too many factors can lead to an overfit model (one that matches the training data too well), sacrificing predictive ability. Thus, if you use the default, you should also either specify the PARTITION statement to select the appropriate number of factors for the final model or consider the analysis to be preliminary and examine the results to determine the appropriate number of factors for a subsequent analysis.

NOCENTER

suppresses centering of the responses and predictors before fitting. This option is useful if the analysis variables are already centered and scaled. For more information, see the section Centering and Scaling.

NOCLPRINT<=number>

suppresses the display of the "Class Level Information" table if you do not specify number. If you specify number, the values of the classification variables are displayed only for variables whose number of levels is less than number. Specifying a number helps to reduce the size of the "Class Level Information" table if some classification variables have a large number of levels.

NOCVSTDIZE

suppresses re-centering and rescaling of the responses and predictors before each model is fit in the cross validation. For more information, see the section Centering and Scaling.

NOPRINT

suppresses the normal display of results. This option is useful when you want only the output statistics saved in a data set. This option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 20: Using the Output Delivery System.

NOSCALE

suppresses scaling of the responses and predictors before fitting. This option is useful if the analysis variables are already centered and scaled. For more information, see the section Centering and Scaling.

VARSS

lists, in addition to the average response and predictor sum of squares accounted for by each successive factor, the amount of variation accounted for in each response and predictor.