Linear Regression Task

About the Linear Regression Task

Linear regression analysis tries to assign a linear function to your data by using the least squares method. Using the Linear Regression task, you can perform linear regression analysis on multiple dependent and independent variables.

Example: Predicting Weight Based on a Student’s Height

In this example, you want to use regression analysis to find out how well you can predict a child's weight if you know the child's height.

To create this example:

In the Tasks section, expand the Statistics folder and double-click Linear Regression. The user interface for the Linear Regression task opens.
On the Data tab, select the SASHELP.CLASS data set.

Assign columns to these roles:

Role	Column Name
Dependent variable	Weight
Classification variables	Sex
Continuous variables	Height Age

Click the Model tab and create these models:
1. Select the Height variable, and then press Ctrl and select the Age variable. Click Add.
2. Select the Height variable, and then press Ctrl and select the Sex variable. Click Cross.
To run the task, click .

Here is a subset of the results:

Tabular Results for Linear Regression Example

Graph of the Observed Values by the Predicted Values for Weight

Fit Diagnostics for Weight in Linear Regression Example

Assigning Data to Roles

To run the Linear Regression task, you must assign a column to the Dependent variable role and a column to the Classification variables role or the Continuous variables role.

Role	Description
Roles
Dependent variable	specifies the numeric variable to use as the dependent variable for the regression analysis. You must assign a numeric variable to this role.
Classification variables	specifies categorical variables that enter the regression model through the design matrix coding.
Parameterization of Effects
Coding	specifies the parameterization method for the classification variable. Design matrix columns are created from the classification variables according to the selected coding scheme. You can select from these coding schemes: Effects coding specifies effect coding. GLM coding specifies less-than-full-rank, reference-cell coding. This coding scheme is the default. Reference coding specifies reference-cell coding.
Treatment of Missing Values
An observation is excluded from the analysis when either of these conditions is met: if any variable in the model contains a missing value if any classification variable contains a missing value (regardless of whether the classification variable is used in the model)
Continuous variables	specifies the numeric covariates (regressors) for the regression model.
Additional Roles
Frequency count	lists a numeric variable whose value represents the frequency of the observation. If you assign a variable to this role, the task assumes that each observation represents n observations, where n is the value of the frequency variable. If n is not an integer, SAS truncates it. If n is less than 1 or is missing, the observation is excluded from the analysis. The sum of the frequency variable represents the total number of observations.
Weight	specifies the variable to use as a weight to perform a weighted analysis of the data.
Group analysis by	specifies to create a separate analysis for each group of observations.

Building a Model

Requirements for Building a Model

To specify an effect, you must assign at least one column to the Classication variables role or the Continuous variables role. You can select combinations of variables to create crossed, nested, factorial, or polynomial effects. You can also specify whether to include the intercept in the model.

To create the model, use the model builder on the Model tab.

Create a Main Effect

Select the variable name in the Variables box.
Click Add to add the variable to the Model effects box.

Create Crossed Effects (Interactions)

Select two or more variables in the Variables box. To select more than one variable, press Ctrl.
Click Cross.

Create a Nested Effect

Nested effects are specified by following a main effect or crossed effect with a classification variable or list of classification variables enclosed in parentheses. The main effect or crossed effect is nested within the effects listed in parentheses. Here are examples of nested effects: B(A), C(B*A), D*E(C*B*A). In this example, B(A) is read "B nested within A."

Select the effect name in the Model effects box.
Click Nest. The Nested window opens.
Select the variable to use in the nested effect. Click Outer or Nested within Outer to specify how to create the nested effect.

Note: The Nested within Outer button is available only when a classification variable is selected.
Click Add.

Create a Full Factorial Model

Select two or more variables in the Variables box.
Click Full Factorial.

For example, if you select the Height, Weight, and Age variables and then click Full Factorial, these model effects are created: Age, Height, Weight, Age*Height, Age*Weight, Height*Weight, and Age*Height*Weight.

Create N-Way Factorial

Select two or more variables in the Variables box.
Click N-way Factorial to add these effects to the Model effects box.

For example, if you select the Height, Weight, and Age variables and then specify the value of N as 2, when you click N-way Factorial, these model effects are created: Age, Height, Weight, Age*Height, Age*Weight, and Height*Weight. If N is set to a value greater than the number of variables in the model, N is effectively set to the number of variables.

Create Polynomial Effects of the Nth Order

Select one variable in the Variables box.
Specify higher-degree crossings by adjusting the number in the N field.
Click Polynomial Order=N to add the polynomial effects to the Model effects box.

For example, if you select the Age and Height variables and then you specify 3 in the N field, when you click Polynomial Order=N, these model effects are created: Age, Age*Age, Age*Age*Age, Height, Height*Height, and Height*Height*Height.

Setting the Model Options

Option Name	Description
Methods
Confidence level	specifies the significance level to use for the construction of confidence intervals.
Statistics
You can choose to include the default statistics in the results or choose to include additional statistics.
Additional available statistics
Parameter Estimates
Standardized regression coefficients	displays the standardized regression coefficients. A standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor.
Confidence limits for estimates	displays the upper and lower confidence limits for the parameter estimates.
Sums of Squares
Sequential sum of squares (Type I)	displays the sequential sums of squares (Type I SS) along with the parameter estimates for each term in the model.
Partial sum of squares (Type II)	displays the partial sums of squares (Type II SS) along with the parameter estimates for each term in the model.
Partial and Semipartial Correlations
Squared partial correlations	displays the squared partial correlation coefficients computed by using Type I and Type II sums of squares.
Squared semipartial correlations	displays the squared semipartial correlation coefficients computed by using Type I and Type II sums of squares. This value is calculated as sum of squares divided by the corrected total sum of squares.
Diagnostics
Analysis of influence	requests a detailed analysis of the influence of each observation on the estimates and the predicted values.
Analysis of residuals	requests an analysis of the residuals. The results include the predicted values from the input data and the estimated model, the standard errors of the mean predicted and residual values, the studentized residual, and Cook’s D statistic to measure the influence of each observation on the parameter estimates.
Predicted values	calculates predicted values from the input data and the estimated model.
Multiple Comparisons
Perform multiple comparisons	specifies whether to compute and compare the least squares means of fixed effects.
Select the effects to test	specifies the effects that you want to compare. You specified these effects on the Model tab.
Method	requests a multiple comparison adjustment for the p-values and confidence limits for the differences of the least squares means. Here are the valid methods: Bonferroni, Nelson, Scheffé, Sidak, and Tukey.
Significance level	requests that a t type confidence interval be constructed for each of the least squares means with a confidence level of 1 – number. The value of number must be between 0 and 1. The default value is 0.05.
Collinearity
Collinearity analysis	requests a detailed analysis of collinearity among the regressors. This includes eigenvalues, condition indices, and decomposition of the variances of the estimates with respect to each eigenvalue.
Tolerance values for estimates	produces tolerance values for the estimates. Tolerance for a variable is defined as , where R square is obtained from the regression of the variable on all other regressors in the model.
Variance inflation factors	produces variance inflation factors with the parameter estimates. Variance inflation is the reciprocal of tolerance.
Heteroscedasticity
Heteroscedasticity analysis	performs a test to confirm that the first and second moments of the model are correctly specified.
Asymptotic covariance matrix	displays the estimated asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity and heteroscedasticity-consistent standard errors of parameter estimates.
Plots
Diagnostic and Residual Plots
By default, several diagnostic plots are included in the results. You can also specify whether to include plots of the residuals for each explanatory variable.
More Diagnostic Plots
Rstudent statistic by predicted values	plots studentized residuals by predicted values. If you select the Label extreme points option, observations with studentized residuals that lie outside the band between the reference lines are deemed outliers.
DFFITS statistic by observations	plots the DFFITS statistic by observation number. If you select the Label extreme points option, observations with a DFFITS statistic greater in magnitude than are deemed influential. The number of observations used is n, and the number of regressors is p.
DFBETAS statistic by observation number for each explanatory variable	produces panels of DFBETAS by observation number for the regressors in the model. You can view these plots as a panel or as individual plots. If you select the Label extreme points option, observations with a DFBETAS statistic greater in magnitude than $fraction 2 , over square root of n end fraction$ are deemed influential for that regressor. The number of observations used is n.
Label extreme points	identifies the extreme values on each different type of plot.
Scatter Plots
Fit plot for a single explanatory variable	produces a scatter plot of the data overlaid with the regression line, confidence band, and prediction band for models that depend on at most one regressor. The intercept is excluded. When the number of points exceeds the value for the Maximum number of plot points option, a heat map is displayed instead of a scatter plot.
Observed values by predicted values	produces a scatter plot of the observed values versus the predicted values.
Partial regression plots for each explanatory variable	produces partial regression plots for each regressor. If you display these plots in a panel, there is a maximum of six regressors per panel.
Maximum number of plot points	specifies the maximum number of points to include in each plot.

Setting the Model Selection Options

Option	Description
Model Selection
Selection method	specifies the model selection method for the model. The task performs model selection by examining whether effects should be added to or removed from the model according to the rules that are defined by the selection method. Here are the valid values for the selection methods: None fits the full model. Forward selection starts with no effects in the model and adds effects based on the value of the specified criterion. Backward elimination starts with all the effects in the model and deletes effects based on the value of the specified criterion. Stepwise selection is similar to the forward selection model. However, effects that are already in the model do not necessarily stay there. Effects are added to the model based on the values of the specified criteria.
Criterion to add or remove effects	specifies the criterion to use to add or remove effects from the model.
Criterion to stop adding or removing effects	specifies the criterion to use to stop adding or removing effects from the model.
Select best model by	specifies the criterion to use to identify the best fitting model.
Selection Statistics
Model fit statistics	specifies which model fit statistics are displayed in the fit summary table and the fit statistics tables. If you select Default fit statistics, the default set of statistics that are displayed in these tables includes all the criteria used in model selection. Here are the additional fit statistics that you can include in the results: Adjusted R-square Akaike’s information criterion Akaike’s information criterion corrected for small-sample bias Bayesian information criterion Mallows’ Cp Press statistic, which specifies the predicted residual sum of squares statistic R-square Schwarz’s Bayesian information criterion
Selection Plots
Criterion plots	displays plots for these criteria: adjusted R-square, Akaike’s information criterion, Akaike’s information criterion corrected for small-sample bias, and the criterion used to select the best fitting model.
Coefficient plots	displays these plots: a plot that shows the progression of the parameter values as the selection process proceeds a plot that shows the progression of the criterion used to select the best fitting model
Details
Selection process details	specifies how much information about the selection process to include in the results. You can display a summary, details for each step of the selection process, or all of the information about the selection process.

Creating Output Data Sets

You can specify whether to create an observationwise statistics data set. This data set contains the sum of squares and cross-products. By default, this data set is saved in the Work library.

You can also choose to include these statistics in the output data set:

Cook’s D influence
the standard influence of observation on covariance of betas
the standard influence of an observation on predicted value (called DFFITS)
leverage
predicted values
press statistic, which is the ith residual divided by , where h is the leverage, and where the model has been refit without the ith observation
residual
studentized residuals, which are the residuals divided by their standard errors
studentized residual with current observation removed