The GENESELECT Procedure (Experimental)

Overview: GENESELECT Procedure

The GENESELECT procedure identifies influential genetic and environmental variables and their interactions by fitting a model to predict a trait and then evaluating the influence that the predictor variables and their interactions have on the model.

The GENESELECT procedure does the following:

  • allows qualitative and quantitative variables

  • allows a large number of predictor variables

  • minimizes bias from missing values

  • outputs an influence measure for single variables and interactions

  • outputs an influence measure of a variable for each observation

  • outputs predictions, including probabilities for qualitative traits

  • outputs a matrix of similarities between predictor variables

The procedure can handle a large number of variables because only one variable enters the model at a time. The procedure usually stops before using all the available variables.

If two variables are highly correlated, the model might use only one of the variables. The other variable would therefore not appear influential, although it might actually be the biologically relevant variable. The GENESELECT procedure estimates how similar the model would be if a variable in the model were replaced by another, and then outputs the set of estimates as a similarity matrix. The matrix includes variables not in the model. Thus, when the model identifies one variable, the similarity matrix identifies the set of similar variables, one of which might be biologically important. Note that similar variables need not be correlated.

If two variables are highly correlated and the model uses both of them, the credit for influence is shared between them. Neither of them gets the credit it would if one of them were replaced by the other. To correctly estimate the influence of a variable, avoid using similar variables in the model.

A predictive model is likely to find fewer influential variables than a set of association tests would. For example, the GENESELECT procedure is likely to find fewer influential variables than the CASECONTROL procedure. The reason is simple: once a predictive model is sufficiently fit, other variables are not needed, not even ones that are associated with the trait.

The GENESELECT procedure minimizes bias from missing values in two ways. When entering a variable into the fit of the model, the procedure uses all observations with nonmissing values on the variable, and ignores all observations missing that variable value. When applying the model for prediction, the procedure replaces the value by the distribution of nonmissing values among all observations used during the fit.

The GENESELECT procedure fits a model by recursively partitioning the data. A partitioning procedure is one that searches for an optimal partition of the data. The partitioning or splitting rule is defined in terms of the values of a single variable. Optimality depends on the distribution of the trait into the partition segments. The more similar the trait values are within the segments, the better the partition. Decision trees are the most common type of recursive partitioning model. The GENESELECT procedure fits a boosted series of decision trees by default.

A recursive partitioning procedure partitions the data into subsets and then partitions each of the subsets, and so on. In the terminology of the tree metaphor, the partition subsets are nodes, the original data set is the root node, and the final, unpartitioned subsets are terminal nodes or leaves. Nodes that are not terminal nodes are sometimes called internal nodes. The subsets of a single partition are commonly called child nodes, thereby mixing the metaphor with genealogy, which also provides the terms descendant and ancestor nodes. A branch of a node consists of a child node and its descendants.