Data Cleaning Techniques
Duration: 2.0 days CEU: 1.2
Presented by Ron Cody, author of Cody's Data Cleaning Techniques Using SAS Software
In this course, designed for SAS programmers, students learn techniques for finding errors in raw data or SAS data sets. These techniques involve using DATA step programming and other SAS procedures.
Learn how to
- check character and numeric variables for invalid values
- work with and manipulate dates
- compare two data sets with and without an ID variable.
Who should attend
SAS programmers, analysts, and researchers
Prerequisites
Before attending this course, you should have knowledge and experience at the level of the
SAS Programming I: Essentials course. However, some of the programs discussed in the course use more advanced techniques such as FIRST. and LAST. temporary variables and macro variables. Any SAS programming techniques that are beyond the
SAS Programming I: Essentials level are explained thoroughly in the course.
Course Contents
Introduction
- defining data cleaning
- sample data sets used in course
Checking Values of Character Variables
- using the FREQ procedure to list values
- using a DATA step to check for invalid values
- using the PRINT procedure with a WHERE statement to list invalid values
- using SAS formats to check for invalid values
- using SAS informats to check for invalid values
Checking Values of Numeric Variables
- using the MEANS, TABULATE, and UNIVARIATE procedures to look for outliers
- using ODS SELECT with PROC UNIVARIATE
- using the PRINT procedure with a WHERE statement to list invalid data values
- using a DATA step to check for invalid values
- creating a macro for range checking
- using SAS formats to check for invalid values
- using SAS informats to check for invalid values
- using PROC UNIVARIATE to look for highest and lowest values by percentage
- using the RANK procedure to look for highest and lowest values by percentage
- using PROC RANK to look for the n highest and lowest values
- producing trimmed statistics
- checking a range using an algorithm based on standard deviation
- checking a range based on the interquartile range
- checking ranges for several variables
Checking for Missing Values
- inspecting the SAS log
- using the MEANS and FREQ procedures to count missing values
- using DATA step approaches to identify and count missing values
- using the TABULATE procedure to count missing and nonmissing values for numeric variables
- using PROC TABULATE to count missing and nonmissing values for character variables
- creating a general-purpose program to count missing and nonmissing values for both numeric and character variables
- searching for a specific numeric value (such as 999)
Working with Dates
- checking ranges for dates (using a DATA step)
- checking ranges for dates (using the PRINT procedure)
- checking for invalid dates
- working with dates in nonstandard form
- creating a SAS date when the day of the month is missing
- suspending error checking for known invalid dates
Looking for Duplicates and
n Observations per Subject
- eliminating duplicates using the SORT procedure
- detecting duplicates using DATA step approaches
- selecting patients with duplicate observations using a macro list and SQL
- using PROC FREQ to detect duplicate IDs
- identifying subjects with n observations each (using a DATA step)
- identifying subjects with n observations each (using the FREQ procedure)
Working with Multiple Files
- checking for an ID in each of two files
- checking for an ID in each of n files
- using a simple macro to check IDs in multiple files
- using a more complicated multi-file macro for ID checking
- using more complicated multi-file rules
- checking that the dates are in the proper order
Double Entry and Verification (The COMPARE Procedure)
- conducting a simple comparison of two data sets without an ID variable
- using the COMPARE procedure with an ID variable
- using PROC COMPARE with two data sets of unequal number of observations
- comparing two data sets when some variables are not in both data sets
Some SQL Solutions to Data Cleaning
- a quick review of the SQL procedure
- checking for invalid character values
- checking for outliers
- checking a range using an algorithm based on the standard deviation
- checking for missing values
- range-checking for dates
- detecting duplicates
- identifying subjects with n observations each
- checking for an ID in each of two files
- using more complicated multi-file rules
Correcting Errors
- making simple corrections
- keeping track of the changes
- introducing integrity constraints and audit trails
Software Addressed
This course addresses the following software product(s): Base SAS, SAS/STAT.
Course Materials
Students receive a hardcopy of the course notes and, in some courses, can choose to take home a copy of the course data.
U.S. Schedule
19FEB2009 Irvine, CA
| 02APR2009 Chicago, IL
| 15JUN2009 Rockville, MD
|
Check for additional and updated schedule information online at
support.sas.com/courses/bdct.html.
Registration
To register for this course in the US, call 800-333-7660 or visit
support.sas.com/training.
This course is also available for on-site training, or you can create a custom course by combining material from several courses. For more details, contact SAS Education in Cary, NC at 919-531-7321 or send e-mail to
training@sas.com.