SAS Institute CanadaSAS Institute Canada
We deliver superior software and services that give people the power to make the right decisions. News & Press Contact Information sas.com Profile Jobs at SAS Canada
Home Products & Solutions Customers SAS Canada Events Services & Support www.sas.com

Data Cleaning Techniques

Business Knowledge Series course

Duration: 2.0 days
Course fee: $1,940
EPTO units: 3
CEUs: 1.2
Register now

Presented by Ron Cody, author of Cody's Data Cleaning Techniques Using SAS Software

In this course, designed for SAS programmers, students learn techniques for finding errors in raw data or SAS data sets. These techniques involve using DATA step programming and other SAS procedures.

Learn how to

Who should attend

SAS programmers, analysts, and researchers

Expand/Collapse AllExpand All      Expand/Collapse AllCollapse All      PrintPrint version

Prerequisites
Before attending this course, you should have knowledge and experience at the level of the SAS Programming I: Essentials course. However, some of the programs discussed in the course use more advanced techniques such as FIRST. and LAST. temporary variables and macro variables. Any SAS programming techniques that are beyond the SAS Programming I: Essentials level are explained thoroughly in the course.
Course Contents
Introduction
  • defining data cleaning
  • sample data sets used in course
Checking Values of Character Variables
  • using the FREQ procedure to list values
  • using a DATA step to check for invalid values
  • using the PRINT procedure with a WHERE statement to list invalid values
  • using SAS formats to check for invalid values
  • using SAS informats to check for invalid values
Checking Values of Numeric Variables
  • using the MEANS, TABULATE, and UNIVARIATE procedures to look for outliers
  • using PROC PRINT with a WHERE statement to list invalid data values
  • using a DATA step to check for invalid values
  • creating a macro for range checking
  • using SAS formats to check for invalid values
  • using SAS informats to check for invalid values
  • using PROC UNIVARIATE to look for highest and lowest values by percentage
  • using the RANK procedure to look for highest and lowest values by percentage
  • using PROC RANK to look for the n highest and lowest values
  • finding another way to determine highest and lowest values
  • checking a range using an algorithm based on standard deviation
  • checking a range based on the interquartile range
  • checking ranges for several variables
Checking for Missing Values
  • inspecting the SAS log
  • using PROC MEANS and PROC FREQ to count missing values
  • using DATA step approaches to identify and count missing values
  • using PROC TABULATE to count missing and nonmissing values for numeric variables
  • using PROC TABULATE to count missing and nonmissing values for character variables
  • creating a general-purpose program to count missing and nonmissing values for both numeric and character variables
  • searching for a specific numeric value (such as 999)
Working with Dates
  • checking ranges for dates (using a DATA step)
  • checking ranges for dates (using PROC PRINT)
  • checking for invalid dates
  • working with dates in non-standard form
  • creating a SAS date when the day of the month is missing
  • suspending error checking for known invalid dates
Looking for Duplicates and n Observations per Subject
  • eliminating duplicates using the SORT procedure
  • detecting duplicates using DATA step approaches
  • selecting patients with duplicate observations using a macro list and SQL
  • using PROC FREQ to detect duplicate IDs
  • identifying subjects with n observations each (using a DATA step)
  • identifying subjects with n observations each (using PROC FREQ)
Working with Multiple Files
  • checking for an ID in each of two files
  • checking for an ID in each of n files
  • using a simple macro to check IDs in multiple files
  • using a more complicated multi-file macro for ID checking
  • using more complicated multi-file rules
  • checking that the dates are in the proper order
Double Entry and Verification (The COMPARE Procedure)
  • conducting a simple comparison of two data sets without an ID variable
  • using PROC COMPARE with an ID variable
  • using PROC COMPARE with two data sets of unequal number of observations
  • comparing two data sets when some variables are not in both data sets
Some SQL Solutions to Data Cleaning
  • a quick review of PROC SQL
  • checking for invalid character values
  • checking for outliers
  • checking a range using an algorithm based on the standard deviation
  • checking for missing values
  • range-checking for dates
  • detecting duplicates
  • identifying subjects with n observations each
  • checking for an ID in each of two files
  • using more complicated multi-file rules
Correcting Errors
  • making simple corrections
  • introducing integrity constraints and audit trails
Software
This course addresses Base SAS, SAS/STAT.
Course Materials
Students receive a hardcopy of the course notes and, in some courses, can choose to take home a copy of the course data. You also receive a copy of Cody's Data Cleaning Techniques Using SAS Software.
Register now

Share Your Thoughts

Are there additional topics you'd like for this course to address? Would you like for this course to be offered at another training facility? Let us know by adding to our Interest List.

Course fee and EPTO units will differ for on-site training.

[SAS Institute Inc.]
This page was created using SAS software.
The Power to Know
   Contact Us     Search     Terms of Use & Legal Information     Privacy Statement   Copyright © 2007 SAS Institute Inc. All Rights Reserved