Problem Formulation

Consider the following scenario. A baseball manager wants to identify and group players on the team who are very similar with respect to several statistics of interest. Note that there is no response variable in this example. The manager simply wants to identify different groups of players. The manager also wants to learn what differentiates players in one group from players in a different group.
The data set for this example is located in SAMPSIO.DMABASE. The following table contains a description of the important variables.
Name
Model Role
Measurement Leve
Description
NAME
ID
Nominal
Player name
TEAM
Rejected
Nominal
Team at the end of 1986
POSITION
Rejected
Nominal
Positions played at the end of 1986
LEAGUE
Rejected
Binary
League at the end of 1986
DIVISION
Rejected
Binary
Division at the end of 1986
NO_ATBAT
Input
Interval
Times at bat in 1986
NO_HITS
Input
Interval
Hits in 1986
NO_HOME
Input
Interval
Home runs in 1986
NO_RUNS
Input
Interval
Runs in 1986
NO_RBI
Input
Interval
RBIs in 1986
NO_BB
Input
Interval
Walks in 1986
YR_MAJOR
Input
Interval
Years in the major leagues
CR_ATBAT
Input
Interval
Career times at bat
CR_HITS
Input
Interval
Career hits
CR_HOME
Input
Interval
Career home runs
CR_RUNS
Input
Interval
Career runs
CR_RBI
Input
Interval
Career RBIs
CR_BB
Input
Interval
Career walks
NO_OUTS
Input
Interval
Put outs in 1986
NO_ASSTS
Input
Interval
Assists in 1986
NO_ERROR
Input
Interval
Errors in 1986
SALARY
Rejected
Interval
1987 salary in thousands
LOGSALAR
Input
Interval
Log of 1987 salary
For this example, you set the model role for TEAM, POSITION, LEAGUE, DIVISION, and SALARY to Rejected. SALARY is rejected because this information is contained in the LOGSALAR variable. No target variables are used in a cluster analysis or self-organizing map (SOM). If you want to identify groups based on a target variable, consider a predictive modeling technique and specify a categorical target variable.