Multivariate Analysis: Principal Component Analysis |
A biplot is a display that attempts to represent both the observations and variables of multivariate data in the same plot. SAS/IML Studio provides biplots as part of the Principal Component analysis.
The computation of biplots in SAS/IML Studio follows the presentation given in Friendly (1991) and Jackson (1991). Detailed discussions of how to compute and interpret biplots are available in Gabriel (1971) and Gower and Hand (1996).
The computation of a biplot begins with the data matrix. If you choose to compute principal components from the covariance matrix (on the Method tab; see Figure 26.3), then the data matrix is centered by subtracting the mean of each column. Otherwise, it is standardized so that each variable has zero mean and unit standard deviation.
In either case, let denote the resulting matrix. The singular value decomposition (SVD) of is the factorization
In a biplot, the rows of the matrix are plotted as points, which correspond to observations. The rows of the matrix are plotted as vectors, which correspond to variables.
The choice of determines the scaling of the observations and vectors in the biplot. In general, it is impossible to accurately represent the variables and observations in only two dimensions, but you can choose values of that preserve certain properties of the high-dimensional data. Common choices are , and 1. SAS/IML Studio implements four different versions of the biplot:
The axes at the bottom and left of the biplot are the coordinate axes for the observations. The axes at the top and right of the biplot are the coordinate axes for the vectors.
If the data matrix is not well approximated by a rank-two matrix, then the visual information in the biplot is not a good approximation to the data. In this case, you should not try to interpret the biplot. However, if is close to a rank-two matrix, then you can interpret a biplot in the following ways:
For example, in Figure 26.11 the two principal components account for approximately 74% of the variance in the data. This means that the biplot is a fair (but not good) approximation to the data. The footnote in the plot indicates that the biplot is based on the COV factorization and that the data matrix was standardized (STD).
The variables are grouped: the hitting variables point primarily in the direction of the horizontal axis; no_assts and no_error point primarily in the direction of the vertical axis. The no_outs vector is much shorter than the other vectors, which often indicates that the vector does not lie near the span of the two biplot dimensions.
The hitting variables are strongly correlated with each other. The variables no_assts and no_error are correlated with each other, but they are not correlated with the hitting variables or with no_outs.
Because the biplot is only a moderately good approximation to the data, the following statements are approximately true:
Figure 26.11: Biplot for Baseball Data
Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.