IMSTAT Procedure (Analytics)

GLM Statement

The GLM statement is used to fit models that are similar to those handled by the GLM procedure. There are some important differences in syntax and functionality between the GLM procedure and the GLM statement in IMSTAT.

Syntax

Required Arguments

Optional Argument

GLM Statement Options

Details

Basic Syntax

Informative Missingness

ODS Table Names

Syntax

GLM dependent-variable <(class-variables)> = model-effects </ options>;

Required Arguments

dependent-variable

specifies the variable to model. This variable is also referred to as the response variable.

model-effects

specifies a list of variables to use for modeling the dependent variable.

Optional Argument

class-variables

specifies a list of variables to use as classification variables. The variables in this list take the place of the CLASS statement in traditional SAS procedures.

GLM Statement Options

ALLIDVARS

requests that all variables in the input table are treated as ID variables when a scoring table is produced. In other words, if this option is specified, all variables from the input table, including computed columns, are transferred to the scoring table. This option has no effect unless you specify the SCORE option.

ALPHA=number

specifies a number between 0 and 1 from which to determine the confidence level for approximate confidence intervals of the parameter estimates. The default is α = 0.05, which leads to 100 x (1- α)% = 95% confidence limits for the parameter estimates.

Default

0.05

CHISQ

requests that p-values in the table of parameter estimates and Type III tests are determined as probabilities under a x² distribution. This means that instead of two-sided p-values based on the t distribution, the p-values are computed as two-sided probabilities under a standard normal distribution. Similarly, the assumption of F distributions with finite denominator degrees of freedom is ignored in lieu of assuming infinite degrees of freedom.

CI

specifies to add confidence intervals to the table of parameter estimates. The confidence level is 100*(1-α)% where α is determined by the ALPHA= option. The default value is α = 0.05. This value is equivalent to a 95% confidence limit.

Default

0.05

CLASSFORMATS=("format-name1"<, "format-name2" ...>)

specifies the formats for the classification variables in the model. If you do not specify the CLASSFORMATS= option, the default format is applied for the classification variable. That default format was determined when the table was originally loaded into the server. In the following example, the CLASSFORMAT= values apply to variables x1 and x2.

Alias	CLASSFMT=
Example	glm y (x1 x2) = x3-x7 / classformats=("YN.", "F8.");

CODE <(code-generation-options)>

requests that the server produce SAS scoring code based on the actions that it performed during the analysis. The server generates DATA step code. By default, the code is replayed as an ODS table by the procedure as part of the output of the statement. More frequently, you might want to write the scoring code to an external file by specifying options.

The scoring code computes the predicted value of the response variable on the data scale (the inverse link scale) and prefixes the name with "P_". For example, if the response variable is Y, the generated code stores the predicted value as P_Y. The name of the variable is truncated to fit within the SAS name length requirements.

COMMENT

specifies to add comments to the code in addition to the header block. The header block is added by default.

FILENAME='path'

specifies the name of the external file to which the scoring code is written. This suboption applies only to the scoring code itself. If you request that the server generate IMSTAT programming statements with the IMSTAT suboption, then these statements are saved as an ODS table.

Alias

FILE=

FORMATWIDTH=k

specifies the width to use in formatting derived numbers such as parameter estimates in the scoring code. The server applies the BEST format, and the default format for code generation is BEST20.

Alias	FMTW=
Range	4 to 32

IMSTAT

specifies to generate IMSTAT programming statements that reproduce the analysis in addition to the scoring code. For example, this option is helpful when you perform variable selection and you want to capture the modeling code that reflects only the selected variables.

IMSTATONLY

specifies to generate the IMSTAT programming statements only. No scoring code is produced.

LABELID=id

specifies a group identifier for group processing. The identifier is an integer and is used to create array names and statement labels in the generated code.

LINESIZE=n

specifies the line size for the generated code.

Alias	LS=
Default	72
Range	64 to 256

NOTRIM

requests that the comparison of the formatted values for class variables and group-by variables is based on the full format width with padding. By default, the leading and trailing blanks are removed from the formatted values.

REPLACE

specifies to overwrite the external file with the new contents if the file already exists. This option has no effect unless you specify the FILENAME= option.

EXCLUDE=(list-of-ODS-tables)

specifies the result tables that you want to exclude from being generated on the server and from being sent to the SAS session. The GLM statement can generate the following tables:

Table Name	Table Alias	Description	Condition
ModelInfo		Information about the model—constant across groups or partitions.	This table is shown by default.
ClassLevels	Class	Information about the classification variables, such as the number of levels and their values.	This table is shown when classification variables are present in the model.
Dimensions	Dim	Model dimensions	This table is shown by default.
FitStatistics	Fit	Fit statistics customary for generalized linear models	This table is shown when it is requested with the SELECT= option.
OverallAnova	GlobalAnova	Model, source, and error decomposition of variation	This table is shown when classification variables are present in the model.
ModelAnova	ANOVA	Variance decomposition with significance tests for all model effects	This table is shown when classification variables are present in the model.
ParameterEstimates	ParmEstimates Pest	The solutions for the linear model coefficients	This table is shown when there are no classification variables in the model.
Tests3		Type III tests of model effects	This table is shown when it is requested with the SELECT= option.

Whether a table is shown by default or not, you can request any table with the SELECT= option in the GLM statement. The Condition column in the table identifies when a table is produced by default. For example, if the model contains classification variables, the statement shows an OverallAnova table and a ModelAnova table. If there are no classification variables, the statement shows a table of parameter estimates and no ANOVA tables.

FORMATS=("format-specification"<,...>)

specifies the formats for the GROUPBY variables. If you do not specify the FORMATS= option, or if you omit the entry for a GROUPBY variable, the default format is applied for that variable.

Enclose each format specification in quotation marks and separate each format specification with a comma.

Example

proc imstat data=lasr1.table1;
   statement / groupby=(a b) formats=("8.3", "$10");
quit;

FREQ=variable-name

specifies the numeric variable that provides frequencies for the analysis. For example, if the FREQ= variable has the value 5, then it implies that the record represents five such observations with identical values for the modeling variables. If you specify a FREQ= variable, then only the observations with a value that is not missing and greater than zero for the variable are used in the analysis.

GROUPBY=(variable-list)

specifies a list of variable names, or a single variable name, to use as GROUPBY variables in the order of the grouping hierarchy. If you do not specify any GROUPBY variable names, then the calculation is performed across the entire table—possibly subject to a WHERE clause.

GROUPBYMODE= DATA | MODEL | LASR

specifies the parallelization technique for group-by processing. The default is GROUPBYMODE=MODEL in which threads solve separate models following a lateral reconciliation of cross-product matrices. This mode is appropriate in situations with many groups and relatively small cross-product matrices. Model-parallel processing minimizes passes through the data.

Specify GROUPBYMODE=DATA to form the cross-product matrices in parallel across the data and one group at a time. This data-parallel technique is appropriate in situations with few groups and many observations per group or in applications with large cross-product matrices. Data-parallel processing consumes fewer resources than model-parallel processing but passes through the data more often.

If you specify GROUPBYMODE=LASR, then the server examines the data structure of the groups to select the parallelization mode.

Default

MODEL

GROUPFILTER=(filter-options)

specifies a section of the group-by hierarchy to be included in the computation. With this option, you can request that the server performs the analysis for only a subset of all possible groupings. The subset is determined by applying the group filter to a temporary table that you generate with the GROUPBY statement.

You can specify the following suboptions in the GROUPFILTER option:

DESCENDING

specifies the top section or the bottom section of the groupings to be collected. If the DESCENDING option is specified, the top LIMIT=n (where n > 0) groupings are collected. Otherwise, the bottom LIMIT=n groupings are collected.

Alias

DESC

LIMIT=n

specifies the maximum number of distinct groupings to be collected, where integer n >= 0. If n is zero, then all distinct groupings (up to 2³¹–1) that satisfy the boundary constraints, such as LOWERSCORE=f, are collected.

CAUTION:

High Cardinality Data Sets

Setting n to zero with high-cardinality data sets can significantly delay the response of the server.

SCOREGT=f

specifies the exclusive lower bound for the numeric scores of the distinct groupings to collect.

Alias

SGT=

SCORELT=f

specifies the exclusive upper bound for the numeric scores of the distinct groupings to collect.

Alias

SLT=

VALUEGT=("format-name1" <, "format-name2" ...>)

specifies the exclusive lower bound of the group-by variable’s formatted values for the distinct groupings to collect.

Alias

VGT=

VALUELT=("format-name1" <, "format-name2" ...>)

specifies the exclusive upper bound of the group-by variable’s formatted values for the distinct groupings to collect.

Alias

VLT=

TABLE=table-with-groupby-results

specifies the in-memory table from which to load the group-by hierarchy. If the TABLE= option is not specified, then all other GROUPFILTER= options are ignored.

The following program request all the groupings of State, City, and then Trade_In_Model in the Cars_Program_All table. The groupings are ordered by the maximum value of New_Vehicle_Msrp for each grouping:

proc imstat;
    table example.cars_program_all;
    groupby state city trade_in_model / temptable 
                 weight=new_vehicle_msrp 
                 agg   =(max) 
                 order =weight;  
run;

The TEMPTABLE option in the GROUPBY statement directs the server to save all the groupings in a temporary in-memory table. The following DISTINCT statement requests the count of the distinct unformatted values of Sales_Type for each of the selected groupings of State, City, and Trade_In_Model.

    table example.cars_program_all;
    distinct sales_type / groupfilter=(
                 table  =mylasr.&_TEMPLAST_
                 scoregt=40000
                 valuelt=("FL","Ft Myers","")
                 limit  =20
                 descending);
run;

This example considers only groupings that have maximum values of the New_Vehicle_Msrp above 40,000 and with formatted values that are less than State="FL" and City="Ft Myers." The empty quotation marks result in no restriction on Trade_In_Model values. These groupings are ordered according to the maximum values of New_Vehicle_Msrp. Because of the DESCENDING option, this example collects the 20 top groupings within the specified group-by range for the DISTINCT analysis.

Interaction

If you specify the GROUPFILTER= option, then the GROUPBY= and FORMATS= options have no effect.

IDVARS=(variable-list)

IDVARS=variable-name

specifies the variables from the active table to transfer to the temporary table that is created by scoring the input table. This option has no effect unless the SCORE option is also specified. (See the SCORE option for details about which variables are added to the temporary table by default.) The IDVARS= option should be used to transfer additional columns from the input table to the scoring table.

Alias	ID=
Tip	Instead of this option, you can specify the ALLIDVARS option to transfer all variables from the input table to the scoring table.

INCLUDEMISS

specifies to treat missing values for classification variables as valid levels. If the INCLUDEMISS option is not specified, observations with missing values in the classification variables are not used in the analysis.

INFORMATIVE

requests that missing values are handled by modeling them through extra model effects. These effects consist of dummy variables that take on the value 1 when the value of a continuous model variable that is involved in the effect is missing. Otherwise, they are assigned the value 0. The missing value in the original model effect is replaced with the average value for the effect for the nonmissing values.

For continuous-by-class effects, such as A*x, where A is a classification variable and x is a continuous variable, informative missingness creates multiple dummy columns and substitutes the effect mean of x that corresponds to the respective level of A.

Specifying the INFORMATIVE option implies the INCLUDEMISS option. That is, when you choose to model informative missingness, then missing values for classification variables are treated as valid levels. For more information, see Informative Missingness.

Alias

INFORMMISS

KEYORDER

requests that the results for a partitioned analysis are displayed in the order of the partition keys. If this option is not specified, then results are displayed by using the partitions on the first worker node followed by the partitions on the second node, and so on. Without this option, the results are likely to have random ordering of the partitions. The KEYORDER option makes result collection less efficient but produces a natural, predictable order.

MAXTESTLEV=n

specifies the maximum number of levels in an effect for which the server generates Type III tests. The idea behind the MAXTESTLEV= option is that testing effects for significance that have a large number of levels is typically not meaningful. The effects tend to be highly significant anyway, but determining the exact significance level is computationally intensive. The default value is 300 and implies that no test statistics are produced for any effect that has more than 300 levels.

Default

300

NAME=SAS-name

specifies the name to use for identifying the model in the server output and in the temporary table of results generated by the TEMPTABLE option. SAS name rules apply. For example, the following statements add the 'Model' entry to the ModelInformation table.

proc imstat;
   table hps.iris;
   glm sepalwidth = sepallength / name = FirstModel;
run;

NOCLPRINT <=n>

specifies the number of levels for each classification variables to show in the Class Level Information ODS table. If you do not specify the NOCLPRINT option, all unique values are shown in the order of the class variable levelization. If you specify NOCLPRINT=n, then the values are shown for those classification variables that have less than n levels only. The value for n must be at least 1.

If you specify the NOCLPRINT option without specifying a value for n, then n = 0 is assumed. This enables you to get a listing of the classification variables in the model. This might be useful if you did not identify classification variables explicitly—without listing their (possibly many) levels.

For example, the following Class Level Information table is displayed with NOCLPRINT=4. Because the number of levels for variable Smoking_Status exceeds 4, the values are not displayed.

NOINT

suppresses the inclusion of an intercept in the model. By default, the GLM statement adds an intercept as the first model effect to the model. Exclusion of the intercept is useful in certain models to achieve a desired interpretation of the model effects.

For example, the following code sample shows a cell-means model where the coefficients in the β vector estimate the means of Y in the groups associated with levels of A.

glm y (A) = A / noint;

NOPREPARSE

prevents the procedure from preparsing and pregenerating code for temporary expressions, scoring programs, and other user-written SAS statements.

When this option is specified, the user-written statements are sent to the server "as is" and then the server attempts to generate code from it. If the server detects problems with the code, the error messages might not to be as detailed as the messages that are generated by SAS client. If you are debugging your user-written program, then you might want to preparse and pregenerate code in the procedure. However, if your SAS statements compile and run as you want them to, then you can specify this option to avoid the work of parsing and generating code on the SAS client.

When you specify this option in the PROC IMSTAT statement, the option applies to all statements that can generate code. You can also exclude specific statements from preparsing by using the NOPREPARSE option in statements that allow temporary columns or the SCORE statement.

Alias

NOPREP

PARTITION<=partition-key>

specifies to fit the model separately for each value of the partition key. In other words, the partition variables function as automatic group-by variables for the request.

If you do not specify a value for partition-key, then the analysis is performed for all partitions. If you do specify a value, then the analysis is performed for the specified key value only. You can use the PARTITIONINFO statement to retrieve the valid partition-key values for a table.

Alias

PART=

ROLEVAR=variable-name

specifies a variable in the in-memory table that defines whether an observation belongs to the training set, the validation set, or is to be excluded from the analysis. The role variable can have a numeric or character type, and it can be a temporary computed variable.

If the role variable data type is numeric, the values of variable-name are interpreted as follows:

value = 1: this observation is in the training set
value = 2: this observation is in the validation set
any other value: this observation is to be excluded from the analysis

If the role variable data type is character, the values of variable-name are interpreted as follows:

If the first non-blank character is 't' or 'T', then the observation is in the training set.
If the first non-blank character is 'v' or 'V', then he observation is in the validation set.
Any other value for the first non-blank character, including an all blank entry, leads to the exclusion of the observation from the analysis.

Alias	ROLE=
Interactions	You can divide the data at random into training and validation sets by providing the VALIDATE= and SEED= options.
Interactions	If you specify both the ROLEVAR= option and the VALIDATE= options, then the ROLEVAR= setting supersedes the VALIDATE= option.

SCORE <(score-statistic1 score-statistic2 ...)>

requests that the active table be scored after the model is fit and the results be stored in a temporary table. The server automatically adds all model variables to the temporary table with the score results. These results include the response variable, the class variables, all explanatory variables from which effects are formed, and the WEIGHT=, and FREQ= variables.

In addition, if the active table is partitioned or ordered, the partition variables and order-by variables are transferred from the input table to the temporary table. The temporary table is partitioned and ordered in the same way as the active table.

If the analysis uses the GROUPBY= option, the variables in the group-by list are also transferred to the scoring table. If you want to transfer additional variables, you can specify them with the IDVARS= option.

If you do not specify the list of score statistics, default statistics are computed. These statistics are identified with Yes in the Default column in the table below. You can request that the following statistics be computed for each observation:

Keyword and Aliases	Column Name	Description	Default
PRED, PREDICTED, MEAN	_PRED_	Predicted value	Yes
RESID, RESIDUAL, R	_RESID_	Raw residual (observed - predicted)	Yes
STUDENT	_STUDENT_	Studentized residual	Yes
RSTUDENT	_RSTUDENT_	Studentized residual with the current observation removed	Yes
LEVERAGE, H	_LEVERAGE_	Leverage value of the observation	Yes
STDP	_STDP_	Standard error of the mean predicted value	No
STDR	_STDR_	Standard error of the residual	No
STDI	_STDI_	Standard error of the (individual) predicted value	No
LCLM, LOWERMEAN	_LCLM_	Lower confidence limit for the mean of the predicted value	No
UCLM, UPPERMEAN	_UCLM_	Upper confidence limit for the mean of the predicted value	No
LCL, LOWERPRED	_LCL_	Lower confidence limit for the predicted value	No
UCL, UPPERPRED	_UCL_	Upper confidence limit for the predicted value	No
COOKD, COOKSD	_COOKD_	Cook's D influence measure	No
DFFITS	_DFFITS_	Standardized influence of the observation on predicted value	No
COVRATIO	_COVRATIO_	Standardized influence of the observation on the covariance matrix of the parameter estimates	No
LIKEDIST, LD	_LIKEDIST_	Displacement (distance) of log-likelihood when the observation is removed (assuming normal distribution)	No

If you specify SCORE(_ALL_), then the server calculates and adds all the possible output statistics to the temporary table. The confidence levels for the LCLM, LCL, UCLM, and UCL confidence bounds are determined from the significance level specified in the ALPHA= option as (100 (1-α)%). The default is α = 0.05.

The server determines the column names for the output statistics. This differs from many SAS procedures where you can specify the name for the statistic.

SEED=number

specifies the random number seed for generating random numbers. The random number is used to determine whether an observation belongs to the training or validation data set. The SEED= option has no effect unless you specify the VALPROP= option. If the specified number is negative or zero, the random number generation is based on the computer clock of the server—this generates a non-reproducible random number sequence.

SELECT=(list-of-ODS-tables)

specifies the list of ODS tables that you want to display for the analysis. The specified list replaces the default tables that are generated by the server and displayed. See the EXCLUDE= option for the list of default tables and the table names that you can display.

SHOWSELECTED

requests that the server perform variable selection for the model. A backward selection method is used, where the significance level for an effect to remain in the model is determined by the SLSTAY= option. This option performs variable selection like the VARSEL option, but in contrast to the latter option, it displays output only for the selected effects.

Alias

SHOWSEL

SLSTAY=α

specifies the significance level used in determining whether effects should stay in the model during variable selection.

Default	0.1
Range	0 to 1

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias

TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias

TN=

TEMPTABLE

generates an in-memory temporary table from the result set. The IMSTAT procedure displays the name of the table and stores it in the macro variable, provided that the statement executed successfully.

When the IMSTAT procedure exits, all temporary tables created during the IMSTAT session are removed. Temporary tables are not displayed on a TABLEINFO request, unless the temporary table is the active table for the request.

Interaction

For information about the interaction between the TEMPTABLE, CODE, and SCORE options, see Temporary Tables, Generated Code, and Scoring.

VALIDATE=f

specifies the proportion f in the validation data set.

Alias	VALPROP=
Range	0 to 1
Interaction	If you specify both the ROLEVAR= option and the VALIDATE= option, then the ROLEVAR= setting supersedes the VALIDATE= option.

VARSELECTION

specifies that the server perform variable selection for the model. A backward selection method is used, where the significance level for an effect to remain in the model is determined by the SLSTAY= option. In contrast to the SHOWSEL option, all effects are reported in the IMSTAT output.

Alias

VARSEL

VIF

produces variance inflation factors and tolerances, the reciprocal of the VIF, for the parameter estimates.

WEIGHT=variable-name

specifies the numeric variable to use as a weighing variable in solving the linear model.

When you specify a WEIGHT= variable, the normal equations

are replaced by

The normal equation with a weighted variable

where W is a diagonal matrix with the values of the variable specified in the WEIGHT= option on the diagonal. Only the observations with a weight value that is not missing and greater than zero are used in the analysis.

Details

Basic Syntax

Informative Missingness

ODS Table Names

Basic Syntax

The basic syntax of the GLM statement requires that you specify the response variable (the dependent variable), an equal sign (=) and then the model effects. The dependent variable must be numeric. In contrast to the GLM procedure, you can specify only one dependent variable in the GLM statement of the IMSTAT procedure.

The underlying statistical model of the GLM statement is as follows:

where Y is an (n × 1) random vector of the dependent variable, X is an (n × p) design matrix, B is a (p × 1) vector of coefficients, and e is an (n × 1) vector of random disturbances (errors). The key assumptions of a GLM type of model are that the errors e are uncorrelated, homoscedastic (have the same variance σ²), and have zero mean. If these assumptions are met, then the model is correct and the elements of the vector Y – XB are stochastically unrelated. The goals of the GLM analysis are as follows:

to estimate the unknowns β and σ²
to diagnose the appropriateness of the specified model
to select appropriate variables and terms for the X matrix
to predict the average response and unobserved values of the response variable with confidence
to test hypotheses about the elements of β provided that the model is acceptable

A model effect is a syntactic expression of how one or more variables act together to define columns in the design matrix X of the linear statistical model. In other words, how you specify the model-effects on the right-hand side of the GLM statement affects how the server constructs the X matrix and how you interpret the results of the analysis pertaining to the contributing variables.

There are a few basic types of effects:

the intercept is included by default in every model. It is then the leading effect in X and simply adds a column of ones to this matrix. The intercept can be approximately interpreted as an adjustment for the mean of the response variable.
a continuous effect consists of only numeric non-class variables. The simplest continuous effect contains only one variable. For example, if you add the numeric variable Age to the model, you are adding a continuous effect. If the variable Height is not a classification variable, and you add the term Age*Height to the model, you are adding a continuous interaction effect.
a classification variable is a variable that is used in the model not through its raw values, but through an encoding of its unique (formatted) values. For example, if variable Gender is used as a classification variable in the model, then it represents two levels.
a classification effect is a model effect that contains one or more classification variables. A "pure" classification effect comprises only classification variables, a continuous classification effect also involves some continuous variables.
- If A and B are classification variables, and X and Z are non-classification variables, then the effects A, B, and A*B are pure classification effects, termed the A and B main effect and the interaction of A and B, respectively. The effect A*Z would be a continuous classification effect.
- Effect A is said to be nested within effect B if levels of A within one level of B do not mean the same thing for other levels of B. Nested effects are expressed with parenthetical notation. For example, if City and State are classification variables, then City(State) represents the nested effect of cities within states. One example of appropriate nesting is when city #1 in Alaska refers to a different city than city #1 in Colorado.
- Two effects are said to be crossed if the levels of one effect retain their interpretation across the levels of the other effect. If Married is a classification variable that groups individuals into married and unmarried status, and Gender is a two-level variable, the Gender*Married effect is crossed, because a man in the unmarried group is also a man in the married group.

Deciding which variables to involve in a statistical model and how the variables should act and interact is key in modeling. The following rules apply for the GLM statement in the IMSTAT procedure:

A character variable that is used in a model effect is treated as a classification variable.
A numerical variable that is used in a model effect is treated as a non-classification variable.
All variables explicitly listed in the optional variable list that follows the specification of the dependent variable in the GLM statement are classification variables.
The role of temporary computed variables is determined by the data type.
- If the computed column is of character type, then it is automatically added to the model as a classification variable if it appears in a model effect.
- If the computed column is of numeric type, then it is treated as a classification variable only if it is specified in the list of class-variables.

The following example of modeling the Sashelp.Class data set shows how variables act and interact. The following GLM statement models a student's height as a function of his or her weight and gender:

glm height = weight sex;

The Sex variable, because it has character type, is treated as a classification variable. The following GLM statement is equivalent, but it expresses the classification variables explicitly:

glm height(sex) = weight sex;

Computed columns can also be used. The following example uses the same variables that were used in the previous examples, but specifies them as computed columns to demonstrate the syntax:

table lasr.class(tempnames=(t1 t2));
glm height = t1 t2 / tn=(t1 t2) te="t1=weight; t2=sex;";

The analysis that uses the computed columns (T1 and T2) is identical to the previous GLM statement. This is because T2 would be discovered by the server to be of character type and would be added automatically to the list of classification variables.

Informative Missingness

The concept of informative missingness is one way to account for missing values in statistical analyses and, in particular, statistical modeling. Missing values are a problem because they reduce the amount of available data. When working with classification variables (factors, which are levelized variables), a missing value can be treated as an actual level of the variable and can participate in the analysis.

When continuous variables have missing values, however, the observation is removed from the analysis. In data with many missing values, this can reduce the amount of available data greatly, and the sets of observations used in one model versus another model can vary based on which variables are included in the model.

Of course, there are many reasons for missing values and substituting values for missing values has to be done with caution. For example, the famous Framingham Heart study data set contains 5,209 observations on subjects in a longitudinal study that helped understand the relationship between smoking, cholesterol, and coronary heart disease. One of the variables in the data set is AgeCHDdiag. This variable represents the age at which a patient was diagnosed with coronary heart disease (CHD). If you include this variable in a statistical model, only 1,449 observations are available, since the value cannot be observed unless a patient has experienced CHD. Including this variable acts as a filter that reduces the analysis set to the subjects with CHD. We cannot impute the value for subjects where the variable has a missing value, because we cannot impute an age at which someone who has not had CHD would have contracted coronary heart disease.

With informative missingness, we are not as much substituting imputed values for the missing values, as we are modeling the missingness. Consider a simple linear regression model:

Suppose that some of the values for the regressor variable x are missing. The fitted model uses only observations for which y and x have been observed.

In order to predict the outcome y for an observation with missing x, we either assume that y is missing or substitute a value for the missing x, such as the average value,

. Because the estimate for the intercept is

in the simple linear regression model, the predicted value would be the average response of the nonmissing values,

With informative missingness, we extend the model by adding extra effects for each effect that contains at least one continuous variable. In the simple linear regression model, we add one column to the model and slightly change the content of the x variable:

Extra effects added to the simple linear regression model

The variable x^* contains the original values of x if these are not missing, and the average of x otherwise:

The variable x^** is a dummy variable with value 1 when x is missing, and zero otherwise:

The fitted model is not the same model that results from substituting

for the missing values during training. This can be seen, since the model that simply substitutes

for the missing values is as follows:

simple linear regression, with mean of x substituted for missing values

The informative missing model has an extra parameter, and unless all values of x^** are zero—in which case there are no missing values—the informative missing model has a higher R² value, because it picks up more variation.

The parameter estimate for

measures the amount by which the predicted value differs from a predicted value at

ODS Table Names

The ODS tables that can be generated with the GLM statement are described in the EXCLUDE= option.

For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.