In the clinical research world, data accuracy plays a significant role in delivering quality results. Various validation methods are available to confirm data accuracy. Of these, double programming is the most highly recommended and commonly used method to demonstrate a perfect match between production and validation output. PROC COMPARE is one of the SAS® procedures used to compare two data sets and confirm the accuracy In the current practice, whenever a program rerun happens, the programmer must manually review the output file to ensure an exact match. This is tedious, time-consuming, and error prone because there are more LST files to be reviewed and manual intervention is required. The proposed approach programmatically validates the output of PROC COMPARE in all the programs and generates an HTML output file with a Pass/Fail status flag for each output file with the following: 1. Data set name, label, and number of variables and observations 2. Number of observations not in both compared and base data sets 3. Number of variables not in both compared and base data sets The status is flagged as Pass whenever the output file meets the following 3N criteria: 1. NOTE: No unequal values were found. All values compared are exactly equal 2. Number of observations in base and compared data sets are equal 3. Number of variables in base and compared data sets are equal The SAS® macro %threeN efficiently validates all the output files generated from PROC COMPARE in a short time and also expedites validation with accuracy. It reduces up to 90% of the time spent on the manual review of the PROC COMPARE output files.
Amarnath Vijayarangan, Emmes Services Pvt Ltd, India
There are currently thousands of Jamaican citizens that lack access to basic health care. In order to improve the health-care system, I collect and analyze data from two clinics in remote locations of the island. This report analyzes data collected from Clarendon Parish, Jamaica. In order to create a descriptive analysis, I use SAS® Studio 9.4. A few of the procedures I use include: PROC IMPORT, PROC MEANS, PROC FREQ, and PROC GCHART. After conducting the aforementioned procedures, I am able to produce a descriptive analysis of the health issues plaguing the island.
Verlin Joseph, Florida A&M University
With increasing data needs, it becomes more and more unwieldy to ensure that all scheduled jobs are running successfully and on time. Worse, maintaining your reputation as an information provider becomes a precarious prospect as the likelihood increases that your customers alert you of reporting issues before you are even aware yourself. By combining various SAS® capabilities and tying them together with concise visualizations, it is possible to track jobs actively and alert customers of issues before they become a problem. This paper introduces a report tracking framework that helps achieve this goal and improve customer satisfaction. The report tracking starts by obtaining table and job statuses and then color-codes them by severity levels. Based on the job status, it then goes deeper into the log to search for potential errors or other relevant information to help assess the processes. The last step is to send proactive alerts to users informing them of possible delays or potential data issues.
Jia Heng, Wyndham Destination Network
This paper aims to show a SAS® macro for generating random numbers of skew-normal and skew-t distributions, as well as the quantiles of these distributions. The results are similar to those generated by the sn package of R software.
Alan Silva, University of Brasilia
Paulo Henrique Dourado da Silva, Banco do Brasil
It is common for hundreds or even thousands of clinical endpoints to be collected from individual subjects, but events from the majority of clinical endpoints are rare. The challenge of analyzing high dimensional sparse data is in balancing analytical consideration for statistical inference and clinical interpretation with precise meaningful outcomes of interest at intra-categorical and inter-categorical levels. Lumping or grouping similar rare events into a composite category has the statistical advantage of increasing testing power, reducing multiplicity size, and avoiding competing risk problems. However, too much or inappropriate lumping would jeopardize the clinical meaning of interest and external validity, whereas splitting or keeping each individual event at its basic clinical meaningful category can overcome the drawbacks of lumping. This practice might create analytical issues of increasing type II errors, multiplicity size, competing risks, and having a large proportion of endpoints with rare events. It seems that lumping and splitting are diametrically opposite approaches, but in fact, they are complementary. Both are essential for high dimensional data analysis. This paper describes the steps required for the lumping and splitting analysis and presents SAS® code that can be used to implement each step.
Shirley Lu, VAMHCS, Cooperative Studies Program
Government funding is a highly sought-after financing medium by entrepreneurs and researchers aspiring to make a breakthrough in the market and compete with larger organizations. The funding is channeled via federal agencies that seek out promising research and products that improve the extant work. This study analyzes the project abstracts that earned the government funding through a text analytic approach over the past three and a half decades in the fields of Defense and Health and Human Services. This helps us understand the factors that might be responsible for awards made to a project. We collected the data of 100,279 records from the Small Business Innovation and Research (SBIR) website (https://www.sbir.gov). Initially, we analyze the trends in government funding over research by the small businesses. Then we perform text mining on the 55,791 abstracts of the projects that are funded by the Department of Defense and the Department of Health. From the text mining, we portray the key topics and patterns related to the past research and determine the changes in their trends over time. Through our study, we highlight the research trends to enable organizations to shape their business strategy and make better investments in the future research.
Sairam Tanguturi, Oklahoma State University
Sagar Rudrake, Oklahoma state University
Nithish Reddy Yeduguri, Oklahoma State University
In health sciences and biomedical research we often search for scientific journal abstracts and publications within MEDLINE, which is a suite of indexed databases developed and maintained by the National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM). PubMed is a free search engine within MEDLINE and has become one of the standard databases to search for scientific abstracts. Entrez is the information retrieval system that gives you direct access to the 40 databases with over 1.3 billion records within the NCBI. You can access these records by using the eight e-utilities (einfo, esearch, summary, efetch, elink, einfo, epost, and egquery), which are the NCBI application programming interfaces (APIs). In this paper I will focus on using three of the e-utilities to retrieve data from PubMed as opposed to the standard method of manually searching and exporting the information from the PubMed website. I will demonstrate how to use the SAS HTTP procedure along with the XML mapper to develop a SAS® macro that generates all the required data from PubMed based on search term parameters. Using SAS to extract information from PubMed searches and save it directly into a data set allows the process to be automated (which is good for running routine updates) and eliminates manual searching and exporting from PubMed.
Craig hansen, South Australian Health and Medical Research Institute
The presentation illustrates techniques and technology to manage complex and large-scale ETL workloads using SAS® and IBM Platform LSF to provide greater control of flow triggering and more complex triggering logic based on a rules-engine approach that parses pending workloads and current activity. Key techniques, configuration steps, and design and development patterns are demonstrated. Pros and cons of the techniques that are used are discussed. Benefits include increased throughput, reduced support effort, and the ability to support more sophisticated inter-flow dependencies and conflicts.
Angus Looney, Capgemini
Breast cancer is the second leading cause of cancer deaths among women in the United States. Although mortality rates have been decreasing over the past decade, it is important to continue to make advances in diagnostic procedures as early detection vastly improves chances for survival. The goal of this study is to accurately predict the presence of a malignant tumor using data from fine needle aspiration (FNA) with visual interpretation. Compared with other methods of diagnosis, FNA displays the highest likelihood for improvement in sensitivity. Furthermore, this study aims to identify the variables most closely associated with accurate outcome prediction. The study utilizes the Wisconsin Breast Cancer data available within the UCI Machine Learning Repository. The data set contains 699 clinical case samples (65.52% benign and 34.48% malignant) assessing the nuclear features of fine needle aspirates taken from patients' breasts. The study analyzes a variety of traditional and modern models, including: logistic regression, decision tree, neural network, support vector machine, gradient boosting, and random forest. Prior to model building, the weights of evidence (WOE) approach was used to account for the high dimensionality of the categorical variables after which variable selection methods were employed. Ultimately, the gradient boosting model utilizing a principal component variable reduction method was selected as the best prediction model with a 2.4% misclassification rate, 100% sensitivity, 0.963 Kolmogorov-Smirnov statistic, 0.985 Gini coefficient, and 0.992 ROC index for the validation data. Additionally, the uniformity of cell shape and size, bare nuclei, and bland chromatin were consistently identified as the most important FNA characteristics across variable selection methods. These results suggest that future research should attempt to refine the techniques used to determine these specific model inputs. Greater accuracy in characterizing the FNA attributes will allow researchers to
develop more promising models for early detection.
Josephine Akosa, Oklahoma State University
Using smart clothing with wearable medical sensors integrated to keep track of human health is now attracting many researchers. However, body movement caused by daily human activities inserts artificial noise into physiological data signals, which affects the output of a health monitoring/alert system. To overcome this problem, recognizing human activities, determining relationship between activities and physiological signals, and removing noise from the collected signals are essential steps. This paper focuses on the first step, which is human activity recognition. Our research shows that no other study used SAS® for classifying human activities. For this study, two data sets were collected from an open repository. Both data sets have 561 input variables and one nominal target variable with four levels. Principal component analysis along with other variable reduction and selection techniques were applied to reduce dimensionality in the input space. Several modeling techniques with different optimization parameters were used to classify human activity. The gradient boosting model was selected as the best model based on a test misclassification rate of 0.1233. That is, 87.67% of total events were classified correctly.
Minh Pham, Oklahoma State University
Mary Ruppert-Stroescu, Oklahoma State University
Mostakim Tanjil, Oklahoma State University
Massive Open Online Courses (MOOC) have attracted increasing attention in educational data mining research areas. MOOC platforms provide free higher education courses to Internet users worldwide. However, MOOCs have high enrollment but notoriously low completion rates. The goal of this study is apply frequentist and Bayesian logistic regression to investigate whether and how students' engagement, intentions, education levels, and other demographics are conducive to course completion in MOOC platforms. The original data used in this study came from an online eight-week course titled Big Data in Education, taught within the Coursera platform (MOOC) by Teachers College, Columbia University. The data sets for analysis were created from three different sources--clickstream data, a pre-course survey, and homework assignment files. The SAS system provides multiple procedures to perform logistic regression, with each procedure having different features and functions. In this study, we apply two approaches--frequentist and Bayesian logistic regression, to MOOC data. PROC LOGISTIC is used for the frequentist approach, and PROC GENMOD is used for Bayesian analysis. The results obtained from the two approaches are compared. All the statistical analyses are conducted in SAS® 9.3. Our preliminary results show that MOOC students with higher course engagement and higher motivation are more likely to complete the MOOC course.
Yan Zhang, Educational Testing Service
Ryan Baker, Columbia University
Yoav Bergner, Educational Testing Service
Thanks to advances in technologies that make data more readily available, sports analytics is an increasingly popular topic. A majority of sports analyses use advanced statistics and metrics to achieve their goal, whether it be prediction or explanation. Few studies include public opinion data. Last year's highly anticipated NCAA College Football Championship game between Ohio State and Oregon broke ESPN and cable television records with an astounding 33.4 million viewers. Given the popularity of college football, especially now with the inclusion of the new playoff system, people seem to be paying more attention than ever to the game. ESPN provides fans with College Pick'em, which gives them a way to compete with their friends and colleagues on a weekly basis, for free, to see who can correctly pick the winners of college football games. Each week, 10 close matchups are selected, and users must select which team they think will win the game and rank those picks on a scale of 1 (lowest) to 10 (highest), according to their confidence level. For each team, the percentage of users who picked that team and the national average confidence are shown. Ideally, one could use these variables in conjunction with other information to enhance one's own predictions. The analysis described in this session explores the relationship between public opinion data from College Pick'em and the corresponding game outcomes by using visualizations and statistical models implemented by various SAS® products.
Taylor Larkin, The University of Alabama
Matt Collins, University of Alabama
SAS® macros offer a very flexible way of developing, organizing, and running SAS code. In many systems, programming components are stored as macros in files and called as macros via a variety of means so that they can perform the task at hand. Consideration must be given to the organization of these files in a manner consistent with proper use and retention. An example might be the development of some macros that are company-wide in potential application, others that are departmental, and still others that might be for personal or individual user. Super-imposed on this structure in a factorial fashion are the need to have areas that are considered: (1) production - validated, (2) archived - retired and (3) developmental. Several proposals for accomplishing this are discussed as well as how this structure might interrelate with stored processes available in some SAS systems. The use of these macros in systems ranging from simple background batch processing to highly interactive SAS and BI processing are discussed. Specifically, the drivers or driver files are addressed. Pro's and con's to all approaches are considered.
Roger Muller, Data-To-Events, Inc.
The HIPAA Privacy Rule can restrict geographic and demographic data used in health-care analytics. After reviewing the HIPAA requirements for de-identification of health-care data used in research, this poster guides the beginning SAS® Visual Analytics user through different options to create a better user experience. This poster presents a variety of data visualizations the analyst will encounter when describing a health-care population. We explore the different options SAS Visual Analytics offers and also offer tips on data preparation prior to using SAS® Visual Analytics Designer. Among the topics we cover are SAS Visual Analytics Designer object options (including geo bubble map, geo region map, crosstab, and treemap), tips for preparing your data for use in SAS Visual Analytics, and tips on filtering data after it's been loaded into SAS Visual Analytics, and more.
Jessica Wegner, Optum
Margaret Burgess, Optum
Catherine Olson, Optum
Customer lifetime value (LTV) estimation involves two parts: the survival probabilities and profit margins. This article describes the estimation of those probabilities using discrete-time logistic hazard models. The estimation of profit margins is based on linear regression. In the scenario, when outliers are present among margins, we suggest applying robust regression with PROC ROBUSTREG.
Vadim Pliner, Verizon Wireless
Debt collection! The two words can trigger multiple images in one's mind--mostly harsh. However, let's try to think positively for a moment. In 2013, over $55 billion of debt were past due in the United States. What if all of these debts were left as is and the fate of credit issuers in the hands of good will payments made by defaulters? Well, this is not the most sustainable model, to say the least. In this situation, debt collection comes in as a tool that is employed at multiple levels of recovery to keep the credit flowing. Ranging from in-house to third-party to individual collection efforts, this industry is huge and plays an important role in keeping the engine of commerce running. In the recent past, with financial markets recovering and banks selling fewer charged-off accounts and at higher prices, debt collection has increasingly become a game of efficient operations backed by solid analytics. This paper takes you into the back alleys of all the data that is in there and gives an overview of some ways modeling can be used to impact the collection strategy and outcome. SAS® tools such as SAS® Enterprise Miner™ and SAS® Enterprise Guide® are extensively used for both data manipulation and modeling. Decision trees are given more focus to understand what factors make the most impact. Along the way, this paper also gives an idea of how analytics teams today are slowly trying to get the buy-in from other stakeholders in any company, which surprisingly is one of the most challenging aspects of our jobs.
Karush Jaggi, AFS Acceptance
Harold Dickerson, SquareTwo Financial
Thomas Waldschmidt, SquareTwo Financial
Phishing is the attempt of a malicious entity to acquire personal, financial, or otherwise sensitive information such as user names and passwords from recipients through the transmission of seemingly legitimate emails. By quickly alerting recipients of known phishing attacks, an organization can reduce the likelihood that a user will succumb to the request and unknowingly provide sensitive information to attackers. Methods to detect phishing attacks typically require the body of each email to be analyzed. However, most academic institutions do not have the resources to scan individual emails as they are received, nor do they wish to retain and analyze message body data. Many institutions simply rely on the education and participation of recipients within their network. Recipients are encouraged to alert information security (IS) personnel of potential attacks as they are delivered to their mailboxes. This paper explores a novel and more automated approach that uses SAS® to examine email header and transmission data to determine likely phishing attempts that can be further analyzed by IS personnel. Previously a collection of 2,703 emails from an external filtering appliance were examined with moderate success. This paper focuses on the gains from analyzing an additional 50,000 emails, with the inclusion of an additional 30 known attacks. Real-time email traffic is exported from Splunk Enterprise into SAS for analysis. The resulting model aids in determining the effectiveness of alerting IS personnel to potential phishing attempts faster than a user simply forwarding a suspicious email to IS personnel.
Taylor Anderson, University of Alabama
Denise McManus, University of Alabama
The new SAS® High-Performance Statistics procedures were developed to respond to the growth of big data and computing capabilities. Although there is some documentation regarding how to use these new high-performance (HP) procedures, relatively little has been disseminated regarding under what specific conditions users can expect performance improvements. This paper serves as a practical guide to getting started with HP procedures in SAS®. The paper describes the differences between key HP procedures (HPGENSELECT, HPLMIXED, HPLOGISTIC, HPNLMOD, HPREG, HPCORR, HPIMPUTE, and HPSUMMARY) and their legacy counterparts both in terms of capability and performance, with a particular focus on discrepancies in real time required to execute. Simulations were conducted to generate data sets that varied on the number of observations (10,000, 50,000, 100,000, 500,000, 1,000,000, and 10,000,000) and the number of variables (50, 100, 500, and 1,000) to create these comparisons.
Diep Nguyen, University of South Florida
Sean Joo, University of South Florida
Anh Kellermann, University of South Florida
Jeff Kromrey, University of South Florida
Jessica Montgomery, University of South Florida
Patricia Rodríguez de Gil, University of South Florida
Yan Wang, University of South Florida
Data-driven decision making is critical for any organization to thrive in this fiercely competitive world. The decision-making process has to be accurate and fast in order to stay a step ahead of the competition. One major problem organizations face is huge data load times in loading or processing the data. Reducing the data loading time can help organizations perform faster analysis and thereby respond quickly. In this paper, we compared the methods that can import data of a particular file type in the shortest possible time and thereby increase the efficiency of decision making. SAS® takes input from various file types (such as XLS, CSV, XLSX, ACCESS, and TXT) and converts that input into SAS data sets. To perform this task, SAS provides multiple solutions (such as the IMPORT procedure, the INFILE statement, and the LIBNAME engine) to import the data. We observed the processing times taken by each method for different file types with a data set containing 65,535 observations and 11 variables. We executed the procedure multiple times to check for variation in processing time. From these tests, we recorded the minimum processing time for the combination of procedure and file type. From our analysis of processing times taken by each importing technique, we observed that the shortest processing times for CSV and TXT files, XLS and XLSX files, and ACCESS files are the INFILE statement, the LIBNAME engine, and PROC IMPORT, respectively.
Divya Dadi, Oklahoma State University
Rahul Jhaver, Oklahoma State University
The experimental item response theory procedure (PROC IRT), included in recently released SAS/STAT® 13.1 and 13.2, enables item response modeling and trait estimation in SAS®. PROC IRT enables you to perform item parameter calibration and latent trait estimation using a wide spectrum of educational and psychological research. This paper evaluates the performance of PROC IRT in item parameter recovery and under various testing conditions. The pros and cons of PROC IRT versus BILOG-MG 3.0 are presented. For practitioners of IRT models, the development of IRT-related analysis in SAS is inspiring. This analysis offers a great choice to the growing population of IRT users. A shift to SAS can be beneficial based on several features of SAS: its flexibility in data management, its power in data analysis, its convenient output delivery, and its increasing richness in graphical presentation. It is critical to ensure the quality of item parameter calibration and trait estimation before you can continue with other components, such as test scoring, test form constructions, IRT equatings, and so on.
Yi-Fang Wu, ACT, Inc.
The importance of understanding terrorism in the United States assumed heightened prominence in the wake of the coordinated attacks of September 11, 2001. Yet, surprisingly little is known about the frequency of attacks that might happen in the future and the factors that lead to a successful terrorist attack. This research is aimed at forecasting the frequency of attacks per annum in the United States and determining the factors that contribute to an attack's success. Using the data acquired from the Global Terrorism Database (GTD), we tested our hypothesis by examining the frequency of attacks per annum in the United States from 1972 to 2014 using SAS® Enterprise Miner™ and SAS® Forecast Studio. The data set has 2,683 observations. Our forecasting model predicts that there could be as many as 16 attacks every year for the next 4 years. From our study of factors that contribute to the success of a terrorist attack, we discovered that attack type, weapons used in the attack, place of attack, and type of target play pivotal roles in determining the success of a terrorist attack. Results reveal that the government might be successful in averting assassination attempts but might fail to prevent armed assaults and facilities or infrastructure attacks. Therefore, additional security might be required for important facilities in order to prevent further attacks from being successful. Results further reveal that it is possible to reduce the forecasted number of attacks by raising defense spending and by putting an end to the raging war in the Middle East. We are currently working on gathering data to do in-depth analysis at the monthly and quarterly level.
SriHarsha Ramineni, Oklahoma State University
Koteswara Rao Sriram, Oklahoma State University
In the fast-changing world of data management, new technologies to handle increasing volumes of (big) data are emerging every day. Many companies are struggling in dealing with these technologies and more importantly in integrating them into their existing data management processes. Moreover, they also want to rely on the knowledge built by their teams in existing products. Implementing change to learn new technologies can therefore be a costly procedure. SAS® is a perfect fit in this situation by offering a suite of software that can be set up to work with any third-party database through using the corresponding SAS/ACCESS® interface. Indeed, for new database technologies, SAS is releasing a specific SAS/ACCESS interface, allowing users to develop and migrate SAS solutions almost transparently. Your users need to know only a few techniques in order to combine the power of SAS with a third-party (big) database. This paper will help companies with the integration of rising technologies in their current SAS® Data Management platform using SAS/ACCESS. More specifically, the paper focuses on best practices for success in such an implementation. Integrating SAS Data Management with the IBM PureData for Analytics Appliance is provided as an example.
Magali Thesias, Deloitte
Color is an important aspect of data visualization and provides an analyst with another tool for identifying data trends. But is the default option the best for every case? Default color scales may be familiar to us; however, they can have inherent flaws that skew our perception of the data. The impact of data visualizations can be considerably improved with just a little thought toward choosing the correct colors. Selecting an appropriate color map can be difficult, and you may decide that it may be easier to generate your own custom color scale. After a brief introduction to the red, green, and blue (RGB) color space, we discuss the strengths and weaknesses of some widely used color scales, as well as what to keep in mind when designing your own. A simple technique is presented, detailing how you can use SAS® to generate a number of color scales and apply them to your data. Using just a few graphics procedures, you can transform almost any complex data into an easily digestible set of visuals. The techniques used in this discussion were developed and tested using SAS® Enterprise Guide® 5.1.
Jeff Grant, Bank of Montreal
Mahmoud Mamlouk, BMO Harris Bank
There are many methods to randomize participants in randomized control trials. If it is important to have approximately balanced groups throughout the course of the trial, simple randomization is not a suitable method. Perhaps the most common alternative method that provides balance is the blocked randomization method. A less well-known method called the treatment adaptive randomized design also achieves balance. This paper shows you how to generate an entire randomization sequence to randomize participants in a two-group clinical trial using the adaptive biased coin randomization design (ABCD), prior to recruiting any patients. Such a sequence could be used in a central randomization server. A unique feature of this method allows the user to determine the extent to which imbalance is permitted to occur throughout the sequence while retaining the probabilistic nature that is essential to randomization. Properties of sequences generated by the ABCD approach are compared to those generated by simple randomization, a variant of simple randomization that ensures balance at the end of the sequence, and by blocked randomization.
Gary Foster, St Joseph's Healthcare
SAS® Output Delivery System (ODS) Graphics started appearing in SAS® 9.2. When first starting to use these tools, the traditional SAS/GRAPH® software user might come upon some very significant challenges in learning the new way to do things. This is further complicated by the lack of simple demonstrations of capabilities. Most graphs in training materials and publications are rather complicated graphs that, while useful, are not good teaching examples. This paper contains many examples of very simple ways to get very simple things accomplished. Over 20 different graphs are developed using only a few lines of code each, using data from the SASHELP data sets. The use of the SGPLOT, SGPANEL, and SGSCATTER procedures are shown. In addition, the paper addresses those situations in which the user must alternatively use a combination of the TEMPLATE and SGRENDER procedures to accomplish the task at hand. Most importantly, the use of ODS Graphics Designer as a teaching tool and a generator of sample graphs and code are covered. The emphasis in this paper is the simplicity of the learning process. Users will be able to take the included code and run it immediately on their personal machines to achieve an instant sense of gratification.
Roger Muller, Data-To-Events, Inc.
PERL is a good language to work with text in in the case of SAS® with character variables. PERL was much used in website programming in the early days of the World Wide Web. SAS provides some PERL functions in DATA step statements. PERL is useful for finding text patterns or simply strings of text and returning the strings or positions in a character variable of the string. I often use PERL to locate a string of text in a character variable. Knowing this location, I know where to start a substr function. The poster covers a number of PERL functions available in SAS® 9.2 and later. Each function is explained with written text, and then a brief code demonstration shows how I use the function. The examples are jointly explained based on SAS documentation and my best practices.
Peter Timusk, Independent
Have you ever wondered how to get the most from Web 2.0 technologies in order to visualize SAS® data? How to make those graphs dynamic, so that users can explore the data in a controlled way, without needing prior knowledge of SAS products or data science? Wonder no more! In this session, you learn how to turn basic sashelp.stocks data into a snazzy HighCharts stock chart in which a user can review any time period, zoom in and out, and export the graph as an image. All of these features with only two DATA steps and one SORT procedure, for 57 lines of SAS code.
Vasilij Nevlev, Analytium Ltd
Working with big data is often time consuming and challenging. The primary goal in programming is to maximize throughputs while minimizing the use of computer processing time, real time, and programmers' time. By using the Multiprocessing (MP) CONNECT method on a symmetric multiprocessing (SMP) computer, a programmer can divide a job into independent tasks and execute the tasks as threads in parallel on several processors. This paper demonstrates the development and application of a parallel processing program on a large amount of health-care data.
Shuhua Liang, Kaiser Permanente
Considering the fact that SAS® Grid Manager is becoming more and more popular, it is important to fulfill the user's need for a successful migration to a SAS® Grid environment. This paper focuses on key requirements and common issues for new SAS Grid users, especially if they are coming from a traditional environment. This paper describes a few common requirements like the need for a current working directory, the change of file system navigation in SAS® Enterprise Guide® with user-given location, getting job execution summary email, and so on. The GRIDWORK directory has been introduced in SAS Grid Manager, which is a bit different from the traditional SAS WORK location. This paper explains how you can use the GRIDWORK location in a more user-friendly way. Sometimes users experience data set size differences during grid migration. A few important reasons for data set size difference are demonstrated. We also demonstrate how to create new custom scripts as per business needs and how to incorporate them with SAS Grid Manager engine.
Piyush Singh, TATA Consultancy Services Ltd
Tanuj Gupta, TATA Consultancy Services
Prasoon Sangwan, Tata consultancy services limited
In this paper, a SAS® macro is introduced that can help users find and access their folders and files very easily. By providing a path to the macro and letting the macro know which folders and files you are looking for under this path, the macro creates an HTML report that lists the matched folders and files. The best part of this HTML report is that it also creates a hyperlink for each folder and file so that when a user clicks the hyperlink, it directly opens the folder or file. Users can also ask the macro to find certain folders or files by providing part of the folder or file name as the search criterion. The results shown in the report can be sorted in different ways so that it can further help users quickly find and access their folders and files.
Ting Sa, Cincinnati Children's Hospital Medical Center
Often SAS® users within business units find themselves running wholly within a single production environment, where non-production environments are reserved for the technology support teams to roll out patches, maintenance releases, and so on. So what is available for business units to set up and manage their intra-platform business systems development life cycle (SDLC)? Another scenario is when a new platform is provisioned, but the business units are responsible for promoting their own content from the old to the new platform. How can this be done without platform administrator roles? This paper examines some typical issues facing business users trying to manage their SAS content, and it explores the capabilities of various SAS tools available to promote and manage metadata, mid-tier content, and SAS® Enterprise Guide® projects.
Andrew Howell, ANJ Solutions P/L
Today, there are 28 million small businesses, which account for 54% of all sales in the United States. The challenge is that small businesses struggle every day to accurately forecast future sales. These forecasts not only drive investment decisions in the business, but also are used in setting daily par, determining labor hours, and scheduling operating hours. In general, owners use their gut instinct. Using SAS® provides the opportunity to develop accurate and robust models that can unlock costs for small business owners in a short amount of time. This research examines over 5,000 records from the first year of daily sales data for a start-up small business, while comparing the four basic forecasting models within SAS® Enterprise Guide®. The objective of this model comparison is to demonstrate how quick and easy it is to forecast small business sales using SAS Enterprise Guide. What does that mean for small businesses? More profit. SAS provides cost-effective models for small businesses to better forecast sales, resulting in better business decisions.
Cameron Jagoe, The University of Alabama
Taylor Larkin, The University of Alabama
Denise McManus, University of Alabama
Optimization models require continuous constraints to converge. However, some real-life problems are better described by models that incorporate discontinuous constraints. A common type of such discontinuous constraints becomes apparent when a regulation-mandated diversification requirement is implemented in an investment portfolio model. Generally stated, the requirement postulates that the aggregate of investments with individual weights exceeding certain threshold in the portfolio should not exceed some predefined total within the portfolio. This format of the diversification requirement can be defined by the rules of any specific portfolio construction methodology and is commonly imposed by the regulators. The paper discusses the impact of this type of discontinuous portfolio diversification constraint on the portfolio optimization model solution process, and develops a convergent approach. The latter includes a sequence of definite series of convergent non-linear optimization problems and is presented in the framework of the OPTMODEL procedure modeling environment. The approach discussed has been used in constructing investable equity indexes.
Taras Zlupko, University of Chicago
Robert Spatz, University of Chicago
Due to advances in medical care and the rise in living standards, life expectancy increased on average to 79 years in the US. This resulted in an increase in aging populations and increased demand for development of technologies that aid elderly people to live independently and safely. It is possible to achieve this challenge through ambient-assisted living (AAL) technologies that help elderly people live more independently. Much research has been done on human activity recognition (HAR) in the last decade. This research work can be used in the development of assistive technologies, and HAR is expected to be the future technology for e-health systems. In this research, I discuss the need to predict human activity accurately by building various models in SAS® Enterprise Miner™ 14.1 on a free public data set that contains 165,633 observations and 19 attributes. Variables used in this research represent the metrics of accelerometers mounted on the waist, left thigh, right arm, and right ankle of four individuals performing five different activities, recorded over a period of eight hours. The target variable predicts human activity such as sitting, sitting down, standing, standing up, and walking. Upon comparing different models, the winner is an auto neural model whose input is taken from stepwise logistic regression, which is used for variable selection. The model has an accuracy of 98.73% and sensitivity of 98.42%.
Venkata Vennelakanti, Oklahoma State University
Many inquisitive minds are filled with excitement and anticipation of response every time one posts a question on a forum. This paper explores the factors that impact the response time of the first response for questions posted in the SAS® Community forum. The factors are contributors' availability, nature of topic, and number of contributors knowledgeable for that particular topic. The results from this project help SAS® users receive an estimated response time, and the SAS Community forum can use this information to answer several business questions such as following: What time of the year is likely to have an overflow of questions? Do specific topics receive delayed responses? Which days of the week are the community most active? To answer such questions, we built a web crawler using Python and Selenium to fetch data from the SAS Community forum, one of the largest analytics groups. We scraped over 13,443 queries and solutions starting from January 2014 to present. We also captured several query-related attributes such as the number of replies, likes, views, bookmarks, and the number of people conversing on the query. Using different tools, we analyzed this data set after clustering the queries into 22 subtopics and found interesting patterns that can help the SAS Community forum in several ways, as presented in this paper.
Praveen Kumar Kotekal, Oklahoma State University
In early 2006, the United States experienced a housing bubble that affected over half of the American states. It was one of the leading causes of the 2007-2008 financial recession. Primarily, the overvaluation of housing units resulted in foreclosures and prolonged unemployment during and after the recession period. The main objective of this study is to predict the current market value of a housing unit with respect to fair market rent, census region, metropolitan statistical area, area median income, household income, poverty income, number of units in the building, number of bedrooms in the unit, utility costs, other costs of the unit, and so on, to determine which factors affect the market value of the housing unit. For the purpose of this study, data was collected from the Housing Affordability Data System of the US Department of Housing and Urban Development. The data set contains 20 variables and 36,675 observations. To select the best possible input variables, several variable selection techniques were used. For example, LARS (least angle regression), LASSO (least absolute shrinkage and selection operator), adaptive LASSO, variable selection, variable clustering, stepwise regression, (PCA) principal component analysis only with numeric variables, and PCA with all variables were all tested. After selecting input variables, numerous modeling techniques were applied to predict the current market value of a housing unit. An in-depth analysis of the findings revealed that the current market value of a housing unit is significantly affected by the fair market value, insurance and other costs, structure type, household income, and more. Furthermore, a higher household income and median income of an area are associated with a higher market value of a housing unit.
Mostakim Tanjil, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
The Oklahoma State Department of Health (OSDH) conducts home visiting programs with families that need parental support. Domestic violence is one of the many screenings performed on these visits. The home visiting personnel are trained to do initial screenings; however, they do not have the extensive information required to treat or serve the participants in this arena. Understanding how demographics such as age, level of education, and household income among others, are related to domestic violence might help home visiting personnel better serve their clients by modifying their questions based on these demographics. The objective of this study is to better understand the demographic characteristics of those in the home visiting programs who are identified with domestic violence. We also developed predictive models such as logistic regression and decision trees based on understanding the influence of demographics on domestic violence. The study population consists of all the women who participated in the Children First Program of the OSDH from 2012 to 2014. The data set contains 1,750 observations collected during screening by the home visiting personnel over the two-year period. In addition, they must have completed the Demographic form as well as the Relationship Assessment form at the time of intake. Univariate and multivariate analysis has been performed to discover the influence that age, education, and household income have on domestic violence. From the initial analysis, we can see that women who are younger than 25 years old, who haven't completed high school, and who are somewhat dependent on their husbands or partners for money are most vulnerable. We have even segmented the clients based on the likelihood of domestic violence.
Soumil Mukherjee, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Miriam McGaugh, Oklahoma state department of Health
Horizontal data sorting is a very useful SAS® technique in advanced data analysis when you are using SAS programming. Two years ago (SAS® Global Forum Paper 376-2013), we presented and illustrated various methods and approaches to perform horizontal data sorting, and we demonstrated its valuable application in strategic data reporting. However, this technique can also be used as a creative analytic method in advanced business analytics. This paper presents and discusses its innovative and insightful applications in product purchase sequence analyses such as product opening sequence analysis, product affinity analysis, next best offer analysis, time-span analysis, and so on. Compared to other analytic approaches, the horizontal data sorting technique has the distinct advantages of being straightforward, simple, and convenient to use. This technique also produces easy-to-interpret analytic results. Therefore, the technique can have a wide variety of applications in customer data analysis and business analytics fields.
Justin Jia, Trans Union Canada
Shan Shan Lin, CIBC
High-quality documentation of SAS® code is standard practice in multi-user environments for smoother group collaborations. One of the documentation items that facilitate program sharing and retrospective review is a header section at the beginning of a SAS program highlighting the main features of the program, such as the program's name, its creation date, the program's aims, the programmer's identification, and the project title. In this header section, it is helpful to keep a list of the inputs and outputs of the SAS program (for example, SAS data sets and files that the program used and created). This paper introduces SAS-IO, a browser-based HTML/JavaScript tool that can automate production of such an input/output list. This can save the programmers' time, especially when working with long SAS programs.
Mohammad Reza Rezai, Institute for Clinical Evaluative Sciences
Revolution Analytics reports more than two million R users worldwide. SAS® has the capability to use R code, but users have discovered a slight learning curve to performing certain basic functions such as getting data from the web. R is a functional programming language while SAS is a procedural programming language. These differences create difficulties when first making the switch from programming in R to programming in SAS. However, SAS/IML® software enables integration between the two languages by enabling users to write R code directly into SAS/IML. This paper details the process of using the SAS/IML command Submit /R and the R package XML to get data from the web into SAS/IML. The project uses public basketball data for each of the 30 NBA teams over the past 35 years, taken directly from Basketball-Reference.com. The data was retrieved from 66 individual web pages, cleaned using R functions, and compiled into a final data set composed of 48 variables and 895 records. The seamless compatibility between SAS and R provide an opportunity to use R code in SAS for robust modeling. The resulting analysis provides a clear and concise approach for those interested in pursuing sports analytics.
Matt Collins, University of Alabama
Taylor Larkin, The University of Alabama
Coronal mass ejections (CMEs) are massive explosions of magnetic field and plasma from the Sun. While responsible for the northern lights, these eruptions can cause geomagnetic storms and cataclysmic damage to Earth's telecommunications systems and power grid infrastructures. Hence, it is imperative to construct highly accurate predictive processes to determine whether an incoming CME will produce devastating effects on Earth. One such process, called stacked generalization, trains a variety of models, or base-learners, on a data set. Then, using the predictions from the base-learners, another model is trained to learn from the metadata. The goal of this meta-learner is to deduce information about the biases from the base-learners to make more accurate predictions. Studies have shown success in using linear methods, especially within regularization frameworks, at the meta-level to combine the base-level predictions. Here, SAS® Enterprise Miner™ 13.1 is used to reinforce the advantages of regularization via the Least Absolute Shrinkage and Selection Operator (LASSO) on this type of metadata. This work compares the LASSO model selection method to other regression approaches when predicting the occurrence of strong geomagnetic storms caused by CMEs.
Taylor Larkin, The University of Alabama
Denise McManus, University of Alabama
This paper examines the various sampling options that are available in SAS® through PROC SURVEYSELECT. We do not cover all of the possible sampling methods or options that PROC SURVEYSELECT features. Instead, we look at Simple Random Sampling, Stratified Random Sampling, Cluster Sampling, Systematic Sampling, and Sequential Random Sampling.
Rachael Becker, University of Central Florida
Drew Doyle, University of Central Florida
With the ever growing global tourism sector, the hotel industry is also flourishing by leaps and bounds, and the standards of quality services by the tourists are also increasing. In today's cyber age where many on the Internet become part of the online community, word of mouth has steadily increased over time. According to a recent survey, approximately 46% of travelers look for online reviews before traveling. A one-star rating by users influences the peer consumer more than the five-star brand that the respective hotel markets itself to be. In this paper, we do the customer segmentation and create clusters based on their reviews. The next process relates to the creation of an online survey targeting the interests of the customers of those different clusters, which helps the hoteliers identify the interests and the hospitality expectations from a hotel. For this paper, we use a data set of 4096 different hotel reviews, with a total of 60,239 user reviews.
Saurabh Nandy, Oklahoma State University
Neha Singh, Oklahoma State University
Sampling, whether with or without replacement, is an important component of the hypothesis testing process. In this paper, we demonstrate the mechanics and outcome of sampling without replacement, where sample values are not independent. In other words, what we get in the first sample affects what we can get for the second sample, and so on. We use the popular variant of poker known as No Limit Texas Hold'em to illustrate.
Dan Bretheim, Towers Watson
Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive question or in fear of embarrassment. Researchers often assume that their data are missing completely at random or missing at random.' Unfortunately, we cannot test whether the mechanism condition is satisfied because missing values cannot be calculated. Alternatively, we can run simulation in SAS® to observe the behaviors of missing data under different assumptions: missing completely at random, missing at random, and being ignorable. We compare the effects from imputation methods if we assign a set of variable of interests to missing. The idea of imputation is to see how efficient substituted values in a data set affect further studies. This lets the audience decide which method(s) would be best to approach a data set when it comes to missing data.
Danny Rithy, California Polytechnic State University
Soma Roy, California Polytechnic State University
This paper describes a Kaiser Permanente Northwest business problem regarding tracking recent inpatient hospital utilization at external hospitals, and how it was solved with the flexibility of SAS® Enterprise Guide®. The Inpatient Indicator is an estimate of our regional inpatient hospital utilization as of yesterday. It tells us which of our members are in which hospitals. It measures inpatient admissions, which are health care interactions where a patient is admitted to a hospital for bed occupancy to receive hospital services. The Inpatient Indicator is used to produce data and create metrics and analysis essential to the decision making of Kaiser Permanente executives, care coordinators, patient navigators, utilization management physicians, and operations managers. Accurate, recent hospital inpatient information is vital for decisions regarding patient care, staffing, and member utilization. Due to a business policy change, Kaiser Permanente Northwest lost the ability to track urgent and emergent inpatient admits at external, non-plan hospitals through our referral system, which was our data source for all recent external inpatient admits. Without this information, we did not have complete knowledge of whether a member had an inpatient stay at an external hospital until a claim was received, which could be several weeks after the member was admitted. Other sources were needed to understand our inpatient utilization at external hospitals. A tool was needed with the flexibility to easily combine and compare multiple data sets with different field names, formats, and values representing the same metric. The tool needed to be able to import data from different sources and export data to different destinations. We also needed a tool that would allow this project to be scheduled. We chose to build the model with SAS Enterprise Guide.
Thomas Gant, Kaiser Permanente
Big data is often distinguished as encompassing high volume, high velocity, or high variability of data. While big data can signal big business intelligence and big business value, it can also wreak havoc on systems and software ill-prepared for its profundity. Scalability describes the ability of a system or software to adequately meet the needs of additional users or its ability to use additional processors or resources to fulfill those added requirements. Scalability also describes the adequate and efficient response of a system to increased data throughput. Because sorting data is one of the most common and most resource-intensive operations in any software language, inefficiencies or failures caused by big data often are first observed during sorting routines. Much SAS® literature has been dedicated to optimizing big data sorts for efficiency, including minimizing execution time and, to a lesser extent, minimizing resource usage (that is, memory and storage consumption). However, less attention has been paid to implementing big data sorting that is reliable and robust even when confronted with resource limitations. To that end, this text introduces the SAFESORT macro, which facilitates a priori exception-handling routines (which detect environmental and data set attributes that could cause process failure) and post hoc exception-handling routines (which detect actual failed sorting routines). If exception handling is triggered, SAFESORT automatically reroutes program flow from the default sort routine to a less resource-intensive routine, thus sacrificing execution speed for reliability. Moreover, macro modularity enables developers to select their favorite sort procedure and, for data-driven disciples, to build fuzzy logic routines that dynamically select a sort algorithm based on environmental and data set attributes.
Troy Hughes, Datmesis Analytics
So, you've encrypted your data. But is it safe? What would happen if that anonymous data you've shared isn't as anonymous as you think? Senior SAS® Consultant Andy Smith from Amadeus Software discusses the approaches hackers take to break encryption, and he shares simple, practical techniques to foil their efforts.
Andy Smith, Amadeus Software Limited
A survey is often the best way to obtain information and feedback to help in program improvement. At a bare minimum, survey research comprises economics, statistics, and psychology in order to develop a theory of how surveys measure and predict important aspects of the human condition. Through healthcare-related research papers written by experts in the industry, I hypothesize a list of several factors that are important and necessary for patients during hospital visits. Through text mining using SAS® Enterprise Miner™ 12.1, I measured tangible aspects of hospital quality and patient care by using online survey responses and comments found on the website for the National Health Service in England. I use these patient comments to determine the majority opinion and whether it correlates with expert research on hospital quality. The implications of this research are vital in comprehending our health-care system and the factors that are important in satisfying patient needs. Analyzing survey responses can help us to understand and mitigate disparities in health-care services provided to population subgroups in the United States. Starting with online survey responses can help us to understand the overall methodology and motivation needed to analyze the surveys that have been developed for departments like the Centers for Medicare and Medicaid Services (CMS).
Divya Sridhar, Deloitte
The HPSUMMARY procedure provides data summarization tools to compute basic descriptive statistics for variables in a SAS® data set. It is a high-performance version of the SUMMARY procedure in Base SAS®. Though PROC SUMMARY is popular with data analysts, PROC HPSUMMARY is still a new kid on the block. The purpose of this paper is to provide an introduction to PROC HPSUMMARY by comparing it with its well-known counterpart, PROC SUMMARY. The comparison focuses on differences in syntax, options, and performance in terms of execution time and memory usage. Sample code, outputs, and SAS log snippets are provided to illustrate the discussion. Simulated large data is used to observe the performance of the two procedures. Using SAS® 9.4 installed on a single-user machine with four cores available, preliminary experiments that examine performance of the two procedures show that HPSUMMARY is more memory-efficient than SUMMARY when the data set is large. (For example, SUMMARY failed due to insufficient memory, whereas HPSUMMARY finished successfully). However, there is no evidence of a performance advantage of the HPSUMMARY over the SUMMARY procedures in this single-user machine.
Anh Kellermann, University of South Florida
Jeff Kromrey, University of South Florida
Time flies. Thirteen years have passed since the first introduction of the hash method by SAS®. Dozens of papers have been written offering new (sometimes weird) applications for hash. Even beyond look-ups, we can use the hash object for summation, splitting files, sorting files, fuzzy matching, and other data processing problems. In SAS, almost any task can be solved by more than one method. For a class of problems where hash can be applied, is it the best tool to use? We present a variety of problems and provide several solutions, using both hash tables and more traditional methods. Comparing the solutions, you must consider computing time as well as your time to code and validate your responses. Which solution is better? You decide!
David Izrael, Abt Associates
Elizabeth Axelrod, Abt Associates
This SAS® test begins where the SAS® Advanced Certification test leaves off. It has 25 questions to challenge even the most experienced SAS programmers. Most questions are about the DATA step, but there are a few macro and SAS/ACCESS® software questions thrown in. The session presents each question on a slide, then the answer on the next. You WILL be challenged.
Glen Becker, USAA
This paper shows how SAS® can be used to obtain a Time Series Analysis of data regarding World War II. This analysis tests whether Truman's justification for the use of atomic weapons was valid. Truman believed that by using the atomic weapons, he would prevent unacceptable levels of U.S. casualties that would be incurred in the course of a conventional invasion of the Japanese islands.
Rachael Becker, University of Central Florida
Administrative health databases, including hospital and physician records, are frequently used to estimate the prevalence of chronic diseases. Disease-surveillance information is used by policy makers and researchers to compare the health of populations and develop projections about disease burden. However, not all cases are captured by administrative health databases, which can result in biased estimates. Capture-recapture (CR) models, originally developed to estimate the sizes of animal populations, have been adapted for use by epidemiologists to estimate the total sizes of disease populations for such conditions as cancer, diabetes, and arthritis. Estimates of the number of cases are produced by assessing the degree of overlap among incomplete lists of disease cases captured in different sources. Two- and three-source CR models are most commonly used, often with covariates. Two important assumptions--independence of capture in each data source and homogeneity of capture probabilities, which underlie conventional CR models--are unlikely to hold in epidemiological studies. Failure to satisfy these assumptions bias the model results. Log-linear, multinomial logistic regression, and conditional logistic regression models, if used properly, can incorporate dependency among sources and covariates to model the effect of heterogeneity in capture probabilities. However, none of these models is optimal, and researchers might be unfamiliar with how to use them in practice. This paper demonstrates how to use SAS® to implement the log-linear, multinomial logistic regression, and conditional logistic regression CR models. Methods to address the assumptions of independence between sources and homogeneity of capture probabilities for a three-source CR model are provided. The paper uses a real numeric data set about Parkinson's disease involving physician claims, hospital abstract, and prescription drug records from one Canadian province. Advantages and disadvantages of each model are discus
sed.
Lisa Lix, University of Manitoba
Someone has aptly said, Las Vegas looks the way one would imagine heaven must look at night. What if you know the secret to run a plethora of various businesses in the entertainment capital of the world? Nothing better, right? Well, we have what you want, all the necessary ingredients for you to precisely know what business to target in a particular locality of Las Vegas. Yelp, a community portal, wants to help people finding great local businesses. They cover almost everything from dentists and hair stylists through mechanics and restaurants. Yelp's users, Yelpers, write reviews and give ratings for all types of businesses. Yelp then uses this data to make recommendations to the Yelpers about which institutions best fit their individual needs. We have the yelp academic data set comprising 1.6 million reviews and 500K tips by 366K in 61K businesses across several cities. We combine current Yelp data from all the various data sets for Las Vegas to create an interactive map that provides an overview of how a business runs in a locality and how the ratings and reviews tickers a business. We answer the following questions: Where is the most appropriate neighborhood to start a new business (such as cafes, bars, and so on)? Which category of business has the greatest total count of reviews that is the most talked about (trending) business in Las Vegas? How does a business' working hours affect the customer reviews and the corresponding rating of the business? Our findings present research for further understanding of perceptions of various users, while giving reviews and ratings for the growth of a business by encompassing a variety of topics in data mining and data visualization.
Anirban Chakraborty, Oklahoma State University
Many SAS® procedures can be used to analyze large amounts of correlated data. This study was a secondary analysis of data obtained from the South Carolina Revenue and Fiscal Affairs Office (RFA). The data includes medical claims from all health care systems in South Carolina (SC). This study used the SAS procedure GENMOD to analyze a large amount of correlated data about Military Health Care (MHS) system beneficiaries who received behavioral health care in South Carolina Health Care Systems from 2005 to 2014. Behavioral health (BH) was defined by Major Diagnostic Code (MDC) 19 (mental disorders and diseases) and 20 (alcohol/drug use). MDCs are formed by dividing all possible principal diagnoses from the International Classification Diagnostic (ICD-9) codes into 25 mutually exclusive diagnostic categories. The sample included a total of 6,783 BH visits and 4,827 unique adult and child patients that included military service members, veterans, and their adult and child dependents who have MHS insurance coverage. PROC GENMOD included a multivariate GEE model with type of BH visit (mental health or substance abuse) as the dependent variable, and with gender, race group, age group, and discharge year as predictors. Hospital ID was used in the repeated statement with different correlation structures. Gender was significant with both independent correlation (p = .0001) and exchangeable structure (p = .0003). However, age group was significant using the independent correlation (p = .0160), but non-significant using the exchangeable correlation structure (p = .0584). SAS is a powerful statistical program for analyzing large amounts of correlated data with categorical outcomes.
Abbas Tavakoli, University of South Carolina
Jordan Brittingham, USC/ Arnold School of Public Health
Nikki R. Wooten, USC/College of Social Work
Stephen Curry, James Harden, and LeBron James are considered to be three of the most gifted professional basketball players in the National Basketball Association (NBA). Each year the Kia Most Valuable Player (MVP) award is given to the best player in the league. Stephen Curry currently holds this title, followed by James Harden and LeBron James, the first two runners-up. The decision for MVP was made by a panel of judges comprised of 129 sportswriters and broadcasters, along with fans who were able to cast their votes through NBA.com. Did the judges make the correct decision? Is there statistical evidence that indicates that Stephen Curry is indeed deserving of this prestigious title over James Harden and LeBron James? Is there a significant difference between the two runners-up? These are some of the questions that are addressed through this project. Using data collected from NBA.com for the 2014-2015 season, a variety of parametric and nonparametric k-sample methods were used to test 20 quantitative variables. In an effort to determine which of the three players is the most deserving of the MVP title, post-hoc comparisons were also conducted on the variables that were shown to be significant. The time-dependent variables were standardized, because there was a significant difference in the number of minutes each athlete played. These variables were then tested and compared with those that had not been standardized. This led to significantly different outcomes, indicating that the results of the tests could be misleading if the time variable is not taken into consideration. Using the standardized variables, the results of the analyses indicate that there is a statistically significant difference in the overall performances of the three athletes, with Stephen Curry outplaying the other two players. However, the difference between James Harden and LeBron James is not so clear.
Sherrie Rodriguez, Kennesaw State University
Marriage laws have changed in 38 states, including the District of Columbia, permitting same-sex couples to marry since 2004. However, since 1996, 13 states have banned same-sex marriages, some of which are under legal challenge. A U.S. animated map, made using the SAS® GMAP procedure in SAS® 9.4, shows the changes in the acceptance, rejection, or undecided legal status of this issue, year-by-year. In this study, we also present other SAS data visualization techniques to show how concentrations (percentages) of married same-sex couples and their relationships have evolved over time, thereby instigating changes in marriage laws. Using a SAS GRAPH display, we attempt to show how both the total number of same-sex-couple households and the percent that are married have changed on a state-by-state basis since 2005. These changes are also compared to the year in which marriage laws permitting same-sex marriages were enacted, and are followed over time since that year. It is possible to examine trends on an annual basis from 2005 to 2013 using the American Community Survey one-year estimates. The SAS procedures and SAS code used to create the data visualization and data manipulation techniques are provided.
Bula Ghose, HHS/ASPE
Susan Queen, National Center for Health Statistics
Joan Turek, Health and Human Services
The LACE readmission risk score is a methodology used by Kaiser Permanente Northwest (KPNW) to target and customize readmission prevention strategies for patients admitted to the hospital. This presentation shares how KPNW used SAS® in combination with Epic's Datalink to integrate the LACE score into its electronic health record (EHR) for usage in real time. The LACE score is an objective measure, composed of four components including: L) length of stay; A) acuity of admission; C) pre-existing co-morbidities; and E) Emergency department (ED) visits in the prior six months. SAS was used to perform complex calculations and combine data from multiple sources (which was not possible for the EHR alone), and then to calculate a score that was integrated back into the EHR. The technical approach includes a trigger macro to kick off the process once the database ETL completes, several explicit and implicit proc SQL statements, a volatile temp table for filtering, and a series of SORT, MEANS, TRANSPOSE, and EXPORT procedures. We walk through the technical approach taken to generate and integrate the LACE score into Epic, as well as describe the challenges we faced, how we overcame them, and the beneficial results we have gained throughout the process.
Delilah Moore, Kaiser Permanente
A number of SAS® tools can be used to report data, such as the PRINT, MEANS, TABULATE, and REPORT procedures. The REPORT procedure is a single tool that can produce many of the same results as other SAS tools. Not only can it create detailed reports like PROC PRINT can, but it can summarize and calculate data like the MEANS and TABULATE procedures do. Unfortunately, despite its power, PROC REPORT seems to be used less often than the other tools, possibly due to its seemingly complex coding. This paper uses PROC REPORT and the Output Delivery System (ODS) to export a big data set into a customized XML file that a user who is not familiar with SAS can easily read. Several options for the COLUMN, DEFINE, and COMPUTE statements are shown that enable you to present your data in a more colorful way. We show how to control the format of the selected columns and rows, make column headings more meaningful, and how to color selected cells differently to bring attention to the most important data.
Guihong Chen, TCF Bank
The SURVEYSELECT procedure is useful for sample selection for a wide variety of applications. This paper presents the application of PROC SURVEYSELECT to a complex scenario involving a state-wide educational testing program, namely, drawing interdependent stratified cluster samples of schools for the field-testing of test questions, which we call items. These stand-alone field tests are given to only small portions of the testing population and as such, a stratified procedure was used to ensure representativeness of the field-test samples. As the field test statistics for these items are evaluated for use in future operational tests, an efficient procedure is needed to sample schools, while satisfying pre-defined sampling criteria and targets. This paper provides an adaptive sampling application and then generalizes the methodology, as much as possible, for potential use in more industries.
Chuanjie Liao, Pearson
Brian Patterson, Pearson
One of the truths about SAS® is that there are, at a minimum, three approaches to achieve any intended task and each approach has its own pros and cons. Identifying and using efficient SAS programming techniques are recommended and efficient programming becomes mandatory when larger data sets are accessed. This paper describes the efficiency of virtual access and various situations to use virtual access of data sets using OPEN, FETCH and CLOSE functions with %SYSFUNC and %DO loops. There are several ways to get data set attributes like number of observations, variables, types of variables, and so on. It becomes more efficient to access larger data sets using the OPEN function with %SYSFUNC. The FETCH with OPEN function gives the ability to access the values of the variables from a SAS data set for conditional and iterative executions with %DO loops. In the following situations, virtual access of the SAS data set becomes more efficient on larger data sets. Situation 1: It is often required to split the master data set into several pieces depending upon a certain criterion for the batch submission, as it is required that data sets be independent. Situation 2: For many reports and dashboards and for the array process, it is required to keep the relevant variables together instead of maintaining the original data set order. Situation 3: For most of the statistical analysis, particularly for correlation and regression, a widely and frequently used SAS procedure requires a dynamic variable list to be passed. Creating a single macro variable list might run into an issue with the macro variable length on the larger transactional data sets. It is recommended that you prepare the list of the variables as a data set and access them using the proposed approach in many places, instead of creating several macro variables.
Amarnath Vijayarangan, Emmes Services Pvt Ltd, India
Few data visualizations are as striking or as useful as heat maps, and fortunately there are many applications for them. A heat map displaying eye-tracking data is a particularly potent example: the intensity of a viewer's gaze is quantified and superimposed over the image being viewed, and the resulting data display is often stunning and informative. This paper shows how to use Base SAS® to prepare, smooth, and transform eye-tracking data and ultimately render it on top of a corresponding image. By customizing a graphical template and using specific Graph Template Language (GTL) options, a heat map can be drawn precisely so that the user maintains pixel-level control. In this talk, and in the related paper, eye-tracking data is used primarily, but the techniques provided are easily adapted to other fields such as sports, weather, and marketing.
Matthew Duchnowski, Educational Testing Service
Making a data delivery to a client is a complicated endeavor. There are many aspects that must be carefully considered and planned for: de-identification, public use versus restricted access, documentation, ancillary files such as programs, formats, and so on, and methods of data transfer, among others. This paper provides a blueprint for planning and executing your data delivery.
Louise Hadden, Abt Associates
In the world of big data, real-time processing and event stream processing are becoming the norm. However, there are not many tools available today that can do this type of processing. SAS® Event Stream Processing aims to process this data. In this paper, we look at using SAS Event Stream Processing to read multiple data sets stored in big data platforms such as Hadoop and Cassandra in real time and to perform transformations on the data such as joining data sets, filtering data based on preset business rules, and creating new variables as required. We look at how we can score the data based on a machine learning algorithm. This paper shows you how to use the provided Hadoop Distributed File System (HDFS) publisher and subscriber to read and push data to Hadoop. The HDFS adapter is discussed in detail. We look at the Streamviewer to see how data flows through SAS Event Stream Processing.
Krishna Sai Kishore Konudula, Kavi Associates