SAS Global Forum 2016 Proceedings

The Waze application, purchased by Google in 2013, alerts millions of users about traffic congestion, collisions, construction, and other complexities of the road that can stymie motorists' attempts to get from A to B. From jackknifed rigs to jackalope carcasses, roads can be gnarled by gridlock or littered with obstacles that impede traffic flow and efficiency. Waze algorithms automatically reroute users to more efficient routes based on user-reported events as well as on historical norms that demonstrate typical road conditions. Extract, transform, load (ETL) infrastructures often represent serialized process flows that can mimic highways and that can become similarly snarled by locked data sets, slow processes, and other factors that introduce inefficiency. The LOCKITDOWN SAS^® macro, introduced at the Western Users of SAS^® Software Conference 2014, detects and prevents data access collisions that occur when two or more SAS processes or users simultaneously attempt to access the same SAS data set. Moreover, the LOCKANDTRACK macro, introduced at the conference in 2015, provides real-time tracking of and historical performance metrics for locked data sets through a unified control table, enabling developers to hone processes in order to optimize efficiency and data throughput. This paper demonstrates the implementation of LOCKANDTRACK and its lock performance metrics to create data-driven, fuzzy logic algorithms that preemptively reroute program flow around inaccessible data sets. Thus, rather than needlessly waiting for a data set to become available or for a process to complete, the software actually anticipates the wait time based on historical norms, performs other (independent) functions, and returns to the original process when it becomes available.

Read the paper (PDF) | Watch the recording

Working with big data is often time consuming and challenging. The primary goal in programming is to maximize throughputs while minimizing the use of computer processing time, real time, and programmers' time. By using the Multiprocessing (MP) CONNECT method on a symmetric multiprocessing (SMP) computer, a programmer can divide a job into independent tasks and execute the tasks as threads in parallel on several processors. This paper demonstrates the development and application of a parallel processing program on a large amount of health-care data.

Read the paper (PDF) | View the e-poster or slides (PDF)

In this paper, a SAS^® macro is introduced that can help users find and access their folders and files very easily. By providing a path to the macro and letting the macro know which folders and files you are looking for under this path, the macro creates an HTML report that lists the matched folders and files. The best part of this HTML report is that it also creates a hyperlink for each folder and file so that when a user clicks the hyperlink, it directly opens the folder or file. Users can also ask the macro to find certain folders or files by providing part of the folder or file name as the search criterion. The results shown in the report can be sorted in different ways so that it can further help users quickly find and access their folders and files.

Read the paper (PDF) | View the e-poster or slides (PDF)

When I help users design or debug their SAS^® programs, they are sometimes unable to provide relevant SAS data sets because they contain confidential information. Sometimes, confidential data values are intrinsic to their problem, but often the problem could still be identified or resolved with innocuous data values that preserve some of the structure of the confidential data. Or the confidential values are in variables that are unrelated to the problem. While techniques for masking or disguising data exist, they are often complex or proprietary. In this paper, I describe a very simple macro, REVALUE, that can change the values in a SAS data set. REVALUE preserves some of the structure of the original data by ensuring that for a given variable, observations with the same real value have the same replacement value, and if possible, observations with a different real value have a different replacement value. REVALUE enables the user to specify which variables to change and whether to order the replacement values for each variable by the sort order of the real values or by observation order. I discuss the REVALUE macro in detail and provide a copy of the macro.

Read the paper (PDF) | Watch the recording

Big data is often distinguished as encompassing high volume, high velocity, or high variability of data. While big data can signal big business intelligence and big business value, it can also wreak havoc on systems and software ill-prepared for its profundity. Scalability describes the ability of a system or software to adequately meet the needs of additional users or its ability to use additional processors or resources to fulfill those added requirements. Scalability also describes the adequate and efficient response of a system to increased data throughput. Because sorting data is one of the most common and most resource-intensive operations in any software language, inefficiencies or failures caused by big data often are first observed during sorting routines. Much SAS^® literature has been dedicated to optimizing big data sorts for efficiency, including minimizing execution time and, to a lesser extent, minimizing resource usage (that is, memory and storage consumption). However, less attention has been paid to implementing big data sorting that is reliable and robust even when confronted with resource limitations. To that end, this text introduces the SAFESORT macro, which facilitates a priori exception-handling routines (which detect environmental and data set attributes that could cause process failure) and post hoc exception-handling routines (which detect actual failed sorting routines). If exception handling is triggered, SAFESORT automatically reroutes program flow from the default sort routine to a less resource-intensive routine, thus sacrificing execution speed for reliability. Moreover, macro modularity enables developers to select their favorite sort procedure and, for data-driven disciples, to build fuzzy logic routines that dynamically select a sort algorithm based on environmental and data set attributes.

Read the paper (PDF) | View the e-poster or slides (PDF)

By default, the SAS^® hash object permits only entries whose keys, defined in its key portion, are unique. While in certain programming applications this is a rather utile feature, there are also others for which being able to insert and manipulate entries with duplicate keys is imperative. Such an ability, facilitated in SAS since SAS^® 9.2, was a welcome development: It vastly expanded the functionality of the hash object and eliminated the necessity to work around the distinct-key limitation using custom code. However, nothing comes without a price, and the ability of the hash object to store duplicate key entries is no exception. In particular, additional hash object methods had to be--and were--developed to handle specific entries sharing the same key. The extra price is that using these methods is surely not quite as straightforward as the simple corresponding operations on distinct-key tables, and the documentation alone is a rather poor help for making them work in practice. Rather extensive experimentation and investigative coding is necessary to make that happen. This paper is a result of such endeavor, and hopefully, it will save those who delve into it a good deal of time and frustration.