Previous Page  Next Page 
Getting Started

Sample Data Sets

The following sample data sets are included with SAS/INSIGHT software.

The AIR data set contains measurements of pollutant concentrations from a city in Germany during a week in November 1989. Variables are

DATETIME
date and hour in SAS DATETIME format

DAY
day of the week

HOUR
hour of the day

CO
carbon monoxide concentration

O3
ozone concentration

SO2
sulfur dioxide concentration

NO
nitrogen oxide concentration

DUST
dust concentration

WIND
wind speed

The BASEBALL data set contains performance measures and salary levels for regular hitters and leading substitute hitters in major league baseball for the year 1986 (Collier 1987). There is one observation per hitter. Variables are
NAME
the player's name

NO_ATBAT
number of times at bat in 1986

NO_HITS
number of hits in 1986

NO_HOME
number of home runs in 1986

NO_RUNS
number of runs in 1986

NO_RBI
number of runs batted in in 1986

NO_BB
number of bases on balls in 1986

YR_MAJOR
years in the major leagues

CR_ATBAT
career at bats

CR_HITS
career hits

CR_HOME
career home runs

CR_RUNS
career runs

CR_RBI
career runs batted in

CR_BB
career bases on balls

LEAGUE
player's league at the end of 1986

DIVISION
player's division at the end of 1986

TEAM
player's team at the end of 1986

POSITION
positions played in 1986

NO_OUTS
number of put outs in 1986

NO_ASSTS
number of assists in 1986

NO_ERROR
number of errors in 1986

SALARY
salary in thousands of dollars

The POSITION variable in the BASEBALL data set is encoded as follows:

13 first base, third base CS center field, shortstop
1B first base DH designated hitter
1O first base, outfield DO designated hitter, outfield
23 second base, third base LF left field
2B second base O1 outfield, first base
2S second base, shortstop OD outfield, designated hitter
32 third base, second base OF outfield
3B third base OS outfield, shortstop
3O third base, outfield RF right field
3S third base, shortstop S3 shortstop, third base
C catcher SS shortstop
CD center field, designated hitter UT utility
CF center field    


The BUSINESS data set contains information on publicly-held German, Japanese, and U.S. companies in the automotive, chemical, electronics, and oil refining industries. There is one observation for each company. Variables are

NATION
the nationality of the company

INDUSTRY
the company's principal business

EMPLOYS
the number of employees

SALES
sales for 1991 in millions of dollars

PROFITS
profits for 1991 in millions of dollars

The DRUG data set contains results of an experiment to evaluate drug effectiveness (Afifi and Azen 1972). Four drugs were tested against three diseases on six subjects; there is one observation for each test. Variables are
DRUG
the drug used in treatment

DISEASE
the disease present

CHANG_BP
the change in systolic blood pressure due to treatment

The GPA data set contains data collected to determine which applicants at a large midwestern university were likely to succeed in its computer science program (Campbell and McCabe 1984). There is one observation per student. Variables are
GPA
the grade point average of students in the computer science program

HSM
the average high school grade in mathematics

HSE
the average high school grade in English

HSS
the average high school grade in science

SATM
the score on the mathematics portion of the SAT exam

SATV
the score on the verbal portion of the SAT exam

SEX
the student's gender

The IRIS data set is Fisher's Iris data (Fisher 1936). Sepal and petal size were measured for fifty specimens from each of three species of iris. There is one observation per specimen. Variables are
SEPALLEN
sepal length in millimeters

SEPALWID
sepal width in millimeters

PETALLEN
petal length in millimeters

PETALWID
petal width in millimeters

SPECIES
the species

The MINING data set contains results of an experiment to determine whether drilling time was faster for wet drilling or dry drilling (Penner and Watts 1991). Tests were replicated three times for each method at different test holes. There is one observation per five-foot interval for each replication. Variables are
DRILTIME
the time in minutes to drill the last five feet of the current depth

METHOD
the drilling method, wet or dry

REP
the replicate number

DEPTH
the depth of the hole in feet

The MININGX data set is a subset of the MINING data set. It contains data from only one of the test holes.

The PATIENT data set contains data collected on cancer patients (Lee 1974). There is one observation per patient. Variables are
REMISS
1 if remission occurred and 0 otherwise

CELL
SMEAR
INFIL
LI
TEMP
BLAST
measures of patient characteristics

The SHIP data set contains data from an investigation of wave damage to cargo ships (McCullagh and Nelder 1989). The purpose of the investigation was to set standards for future hull construction. There is one observation per ship. Variables are
Y
the number of damage incidents

YEAR
year of construction

TYPE
the type of ship

PERIOD
the period of operation

MONTHS
the aggregate months of service

Choose Help:Create Samples to create the sample data sets in your sasuser directory. When you have created the sample data sets, turn to the Techniques part of this manual to learn how to enter your data and begin exploring it with SAS/INSIGHT software.


Note
If you have an existing data set in your sasuser library with the same name as a sample data set, it will be overwritten if you create the sample.

Previous Page  Next Page  Top of Page

Copyright © 2007 by SAS Institute Inc., Cary, NC, USA. All rights reserved.