Sample Data Sets
The following sample data sets are included with SAS/INSIGHT software.
The AIR data set contains measurements of pollutant concentrations from a city in Germany during a week in November 1989. Variables are
- DATETIME
- date and hour in SAS DATETIME format
- DAY
- day of the week
- HOUR
- hour of the day
- CO
- carbon monoxide concentration
- O3
- ozone concentration
- SO2
- sulfur dioxide concentration
- NO
- nitrogen oxide concentration
- DUST
- dust concentration
- WIND
- wind speed
The
BASEBALL data set contains performance measures and salary levels for regular hitters and leading substitute hitters in major league baseball for the year 1986 (Collier 1987). There is one observation per hitter. Variables are
- NAME
- the player's name
- NO_ATBAT
- number of times at bat in 1986
- NO_HITS
- number of hits in 1986
- NO_HOME
- number of home runs in 1986
- NO_RUNS
- number of runs in 1986
- NO_RBI
- number of runs batted in in 1986
- NO_BB
- number of bases on balls in 1986
- YR_MAJOR
- years in the major leagues
- CR_ATBAT
- career at bats
- CR_HITS
- career hits
- CR_HOME
- career home runs
- CR_RUNS
- career runs
- CR_RBI
- career runs batted in
- CR_BB
- career bases on balls
- LEAGUE
- player's league at the end of 1986
- DIVISION
- player's division at the end of 1986
- TEAM
- player's team at the end of 1986
- POSITION
- positions played in 1986
- NO_OUTS
- number of put outs in 1986
- NO_ASSTS
- number of assists in 1986
- NO_ERROR
- number of errors in 1986
- SALARY
- salary in thousands of dollars
The
POSITION variable in the
BASEBALL data set is encoded as follows:
13 |
first base, third base |
CS |
center field, shortstop |
1B |
first base |
DH |
designated hitter |
1O |
first base, outfield |
DO |
designated hitter, outfield |
23 |
second base, third base |
LF |
left field |
2B |
second base |
O1 |
outfield, first base |
2S |
second base, shortstop |
OD |
outfield, designated hitter |
32 |
third base, second base |
OF |
outfield |
3B |
third base |
OS |
outfield, shortstop |
3O |
third base, outfield |
RF |
right field |
3S |
third base, shortstop |
S3 |
shortstop, third base |
C |
catcher |
SS |
shortstop |
CD |
center field, designated hitter |
UT |
utility |
CF |
center field |
|
|
The
BUSINESS data set contains information on publicly-held German, Japanese, and U.S. companies in the automotive, chemical, electronics, and oil refining industries. There is one observation for each company. Variables are
- NATION
- the nationality of the company
- INDUSTRY
- the company's principal business
- EMPLOYS
- the number of employees
- SALES
- sales for 1991 in millions of dollars
- PROFITS
- profits for 1991 in millions of dollars
The
DRUG data set contains results of an experiment to evaluate drug effectiveness (Afifi and Azen 1972). Four drugs were tested against three diseases on six subjects; there is one observation for each test. Variables are
- DRUG
- the drug used in treatment
- DISEASE
- the disease present
- CHANG_BP
- the change in systolic blood pressure due to treatment
The
GPA data set contains data collected to determine which applicants at a large midwestern university were likely to succeed in its computer science program (Campbell and McCabe 1984). There is one observation per student. Variables are
- GPA
- the grade point average of students in the computer science program
- HSM
- the average high school grade in mathematics
- HSE
- the average high school grade in English
- HSS
- the average high school grade in science
- SATM
- the score on the mathematics portion of the SAT exam
- SATV
- the score on the verbal portion of the SAT exam
- SEX
- the student's gender
The
IRIS data set is Fisher's Iris data (Fisher 1936). Sepal and petal size were measured for fifty specimens from each of three species of iris. There is one observation per specimen. Variables are
- SEPALLEN
- sepal length in millimeters
- SEPALWID
- sepal width in millimeters
- PETALLEN
- petal length in millimeters
- PETALWID
- petal width in millimeters
- SPECIES
- the species
The
MINING data set contains results of an experiment to determine whether drilling time was faster for wet drilling or dry drilling (Penner and Watts 1991). Tests were replicated three times for each method at different test holes. There is one observation per five-foot interval for each replication. Variables are
- DRILTIME
- the time in minutes to drill the last five feet of the current depth
- METHOD
- the drilling method, wet or dry
- REP
- the replicate number
- DEPTH
- the depth of the hole in feet
The
MININGX data set is a subset of the
MINING data set. It contains data from only one of the test holes.
The
PATIENT data set contains data collected on cancer patients (Lee 1974). There is one observation per patient. Variables are
- REMISS
- 1 if remission occurred and 0 otherwise
- CELL
- SMEAR
- INFIL
- LI
- TEMP
- BLAST
- measures of patient characteristics
The
SHIP data set contains data from an investigation of wave damage to cargo ships (McCullagh and Nelder 1989). The purpose of the investigation was to set standards for future hull construction. There is one observation per ship. Variables are
- Y
- the number of damage incidents
- YEAR
- year of construction
- TYPE
- the type of ship
- PERIOD
- the period of operation
- MONTHS
- the aggregate months of service
Choose Help:Create Samples to create the sample data sets in your sasuser directory. When you have created the sample data sets, turn to the Techniques part of this manual to learn how to enter your data and begin exploring it with SAS/INSIGHT software.
Note |
If you have an existing data set in your sasuser library with the same name as a sample data set, it will be overwritten if you create the sample. |
Copyright © 2007 by SAS Institute Inc., Cary, NC, USA. All rights reserved.