The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS® Macros to Clean, Prepare, and Manage Data

In 1959, British scientist and novelist C.P. Snow published his influential Rede lecture in book form with the title, The Two Cultures and the Scientific Revolution. Its basic idea was that science and the humanities had become separated. “Renaissance men,” like Leonardo da Vinci, belonged to the past. Da Vinci, renowned as an artistic genius, was also a scientist and inventor. He was an expert in the fields of anatomy, zoology, botany, geology, optics, aerodynamics, and hydrodynamics.

Similar divisions can also be found in other things we humans do. Like data processing. Taking SAS as an example, if Rip Van Winkle had gone to sleep after attending an early 1990s SAS Global Forum and awakened today, he would find that “big data” AI, ML, and “analytics” are everywhere and code seems to have left the stage.

He would be happy to discover that SAS Press has published Kim Chantala’s The Data Detective’s Toolkit, a book that is laden with traditional SAS code.

Why has this book been published? Most probably because it recognizes this fact of life: 80% of data analysis is data management.

Rip Van Winkle would enjoy seeing how traditional Base SAS procedures, such as PROC Freq, are still bread and butter techniques in the era of AI and the cloud. The Data Detective’s Toolkit makes good use of them.

Chantala walks us through the stages of identifying and compiling a record of data quality errors.  She shows how this high-level documentation can be automated through the use of macros. The benefit of this technique is that well-constructed macros can accommodate differences in input data. Dissimilarities are not showstoppers. Cleaning, preparing, and managing data from Stage 3 (Acquire Data) all the way to Stage 9 (Archive Project) are under control.

What kinds of errors do the macros handle? Inconsistent values, duplicate values, and skipped values are the main focus of the detective work. These are the “crimes” discovered in the data, and the book shows how to detect them. If this book goes into a second edition, it might be a good idea to start at the very beginning with the crime scenes and then relate the forensic techniques to them. Start with a real problem and work backward from it to select skills and tools needed to fix it.

This approach only comes into play a bit later in the book. Similarly, a definition of “crosswalk” ought to appear in the introductory portion of the book, not in Chapter 5. Terminology should be explained up front.

Another suggestion: A code technique that deserves to be mentioned is the use of SQL Dictionary Tables that can easily surface anomalies in the descriptive portions of data. This would be another good addition to a future edition.

Overall, The Data Detective’s Toolkit is very well-researched and written. It’s obviously a labor of love and pays attention to many details. A lot of hard work went into it.

Good job Kim Chantala and SAS Press!

Jim Sattler
Satmari Software Systems
Manila, Philippines

 

 

How comfortable are you when receiving new data?  Do you have the right tools and time to know the data as well as any data holes?  What steps are required to clean, prepare and manage your data?

To answer these key questions, you need a collection of macros dedicated to automatically create dataset codebooks containing essential metadata information and the full scope of all variables.  In addition, you need data crosswalks to confirm consistency across dataset metadata and codelists. Why reinvent the wheel when the SAS macros in the Data Detective's Toolkit book are easy to apply and customize for a professional approach to better understanding and processing new data.  

Sunil Gupta
Founder of SASSavvy.com


The Data Detective’s Toolkit can best be described as a self-help relationship guide for data lovers. This description might seem a bit strange, but please stick with me as I explain why this book is worth your time. 

The most meaningful relationships with either people or data require a lot of work. I have found the more I invest in the relationship, the more fruit it bears. Conversely, if I am lazy, impatient, or self-focused on personal desires, my relationships falter. 

Yet, I sometimes am tempted to be lazy, impatient, and self-focused with my data. What gives? I have a simple explanation; Putting in the work required for a successful relationship with data can be difficult and take a lot of time.  If instead of nurturing your relationship with data, you focus on what you want out of it, your relationship with it will be difficult.

I don't think I am the only one who struggles with this problem. Let's be real, whether we call ourselves Data Scientists, Statisticians, or Analysts, when we say that we "love data", what we really mean is we love analyzing it.

The Data Detective's Toolkit helps take some of the sting out of the challenging parts of cultivating our relationship with data so it becomes ready for analysis. The macros provided in this book make this easier and can be adapted to a variety of projects fairly easily. From codebook creation, to skip pattern analysis, to data comparison tools, these macros will assist you in intimately understanding your data, warts and all.

Perhaps more important than the macro's themselves, this book provides a solid framework for managing data throughout its lifecycle. Moreover, it encourages habit forming behaviors that go a long way to ensuring a healthy relationship with your data.

I have worked as a Statistician on a variety of projects. Some I would describe as successful, others not. And while I hesitate to apply a p-value to this statement, in my experience, the fruit a project produces is highly correlated with care taken when managing the data.

So even if you are not a SAS programmer, I encourage you to read The Data Detective’s Toolkit and use the SAS macros and techniques in preparing your data, as it teaches you how to love data, in the most holistic sense of that phrase.

Kevin Adams
Statistician


This book caught me by surprise.  As an avid SAS user for over 10 years, I was fairly certain I knew most of the tricks that were relevant to my day-to-day job; I was wrong, and this book is proof of that.  At just shy of 200 pages, this book is absolutely packed full of useful information to any user of SAS, whether you’re a novice or an expert.  Chantala has created a series of macros that allow SAS to provide the user with a huge variety of extraordinarily useful information, from creating codebooks to analyzing data that has skip patterns (also known as conditional logic).  I tried taking notes while I was reading the book but ended up just copying out the majority of the page; same went for highlighting.  Finally I gave up and just read the book and have came away with so many ideas, I’ll be spending weeks implementing and testing all of them.  I highly recommend this book, but come at it as a journey – read it cover-to-cover once to get a sense of it all, and then go back and pick and choose what you want to explore. You’ll be going back to this book over and over again, I can guarantee it.

Chris Battiston
Research Data Analyst
Women’s College Hospital