Data Cleaning Techniques
Business Knowledge Series course
Duration: 2.0 days
Course fee: $1,150
EPTO units: 2.2
CEUs: 1.2
| |
|
 |
|
Presented by Ron Cody, author of
Cody's Data Cleaning Techniques Using SAS Software
In this course, designed for SAS programmers, students learn techniques for finding errors in raw data or SAS data sets. These techniques involve using DATA step programming and other SAS procedures.
Learn how to
- check character and numeric variables for invalid values
- work with and manipulate dates
- compare two data sets with and without an ID variable.
Who should attend
SAS programmers, analysts, and researchers
Expand All
Collapse All
Print version
Prerequisites
Before attending this course, you should have knowledge and experience at the level of the
SAS Programming I: Essentials course. However, some of the programs discussed in the course use more advanced techniques such as FIRST. and LAST. temporary variables and macro variables. Any SAS programming techniques that are beyond the
SAS Programming I: Essentials level are explained thoroughly in the course.
Course Contents
Introduction
- defining data cleaning
- sample data sets used in course
Checking Values of Character Variables
- using the FREQ procedure to list values
- using a DATA step to check for invalid values
- using the PRINT procedure with a WHERE statement to list invalid values
- using SAS formats to check for invalid values
- using SAS informats to check for invalid values
Checking Values of Numeric Variables
- using the MEANS, TABULATE, and UNIVARIATE procedures to look for outliers
- using ODS SELECT with PROC UNIVARIATE
- using the PRINT procedure with a WHERE statement to list invalid data values
- using a DATA step to check for invalid values
- creating a macro for range checking
- using SAS formats to check for invalid values
- using SAS informats to check for invalid values
- using PROC UNIVARIATE to look for highest and lowest values by percentage
- using the RANK procedure to look for highest and lowest values by percentage
- using PROC RANK to look for the n highest and lowest values
- producing trimmed statistics
- checking a range using an algorithm based on standard deviation
- checking a range based on the interquartile range
- checking ranges for several variables
Checking for Missing Values
- inspecting the SAS log
- using the MEANS and FREQ procedures to count missing values
- using DATA step approaches to identify and count missing values
- using the TABULATE procedure to count missing and nonmissing values for numeric variables
- using PROC TABULATE to count missing and nonmissing values for character variables
- creating a general-purpose program to count missing and nonmissing values for both numeric and character variables
- searching for a specific numeric value (such as 999)
Working with Dates
- checking ranges for dates (using a DATA step)
- checking ranges for dates (using the PRINT procedure)
- checking for invalid dates
- working with dates in nonstandard form
- creating a SAS date when the day of the month is missing
- suspending error checking for known invalid dates
Looking for Duplicates and
n Observations per Subject
- eliminating duplicates using the SORT procedure
- detecting duplicates using DATA step approaches
- selecting patients with duplicate observations using a macro list and SQL
- using PROC FREQ to detect duplicate IDs
- identifying subjects with n observations each (using a DATA step)
- identifying subjects with n observations each (using the FREQ procedure)
Working with Multiple Files
- checking for an ID in each of two files
- checking for an ID in each of n files
- using a simple macro to check IDs in multiple files
- using a more complicated multi-file macro for ID checking
- using more complicated multi-file rules
- checking that the dates are in the proper order
Double Entry and Verification (The COMPARE Procedure)
- conducting a simple comparison of two data sets without an ID variable
- using the COMPARE procedure with an ID variable
- using PROC COMPARE with two data sets of unequal number of observations
- comparing two data sets when some variables are not in both data sets
Some SQL Solutions to Data Cleaning
- a quick review of the SQL procedure
- checking for invalid character values
- checking for outliers
- checking a range using an algorithm based on the standard deviation
- checking for missing values
- range-checking for dates
- detecting duplicates
- identifying subjects with n observations each
- checking for an ID in each of two files
- using more complicated multi-file rules
Correcting Errors
- making simple corrections
- keeping track of the changes
- introducing integrity constraints and audit trails
Software
This course addresses Base SAS, SAS/STAT.
Course Materials
Students receive a hardcopy of the course notes and, in some courses, can choose to take home a copy of the course data.
You also receive a copy of Cody's Data Cleaning Techniques Using SAS(R) Software.
Share Your Thoughts
Are there additional topics you'd like for this course to address?
Would you like for this course to be offered at another training facility?
Let us know by adding to our
Interest List.
This page was created using SAS software.
|