SAS Global Forum 2017 Proceedings

With the increasing amount of educational data, educational data mining has become more and more important for uncovering the hidden patterns within institutional data so as to support institutional decision making (Luan 2012). However, only very limited studies have been done on educational data mining for institutional decision support. At the University of Connecticut (UCONN), organic chemistry is a required course for undergraduate students in a STEM discipline. It has a very high DFW rate (D=Drop, F=Failure, W=Withdraw). Take Fall 2014 as an example: the average DFW% for the Organic Chemistry lectures was 24% at UCONN, and there were over 1200 students enrolled in this class. In this study, undergraduate students enrolled during School Year 2010 2011 were used to build up the model. The purpose of this study was to predict student success in the future so as to improve the education quality in our institution. The Sample, Explore, Modify, Model, and Assess (SEMMA) method introduced by SAS was applied to develop the predictive model. The freshmen SAT scores, campus, semester GPA, financial aid, and other factors were used to predict students' performance in this course. In the predictive modeling process, several modeling techniques (decision tree, neural network, ensemble models, and logistic regression) were compared with each other in order to find an optimal one for our institution.

Read the paper (PDF)

The Consumer Financial Protection Bureau (CFPB) collects tens of thousands of complaints against companies each year, many of which result in the companies in question taking action, including making payouts to the individuals who filed the complaints. Given the volume of the complaints, how can an overseeing organization quantitatively assess the data for various trends, including the areas of greatest concern for consumers? In this presentation, we propose a repeatable model of text analytics techniques to the publicly available CFPB data. Specifically, we use SAS^® Contextual Analysis to explore sentiment, and machine learning techniques to model the natural language available in each free-form complaint against a disposition code for the complaint, primarily focusing on whether a company paid out money. This process generates a taxonomy in an automated manner. We also explore methods to structure and visualize the results, showcasing how areas of concern are made available to analysts using SAS^® Visual Analytics and SAS^® Visual Statistics. Finally, we discuss the applications of this methodology for overseeing government agencies and financial institutions alike.

Read the paper (PDF)

Randomized control trials have long been considered the gold standard for establishing causal treatment effects. Can causal effects be reasonably estimated from observational data too? In observational studies, you observe treatment T and outcome Y without controlling confounding variables that might explain the observed associations between T and Y. Estimating the causal effect of treatment T therefore requires adjustments that remove the effects of the confounding variables. The new CAUSALTRT (causal-treat) procedure in SAS/STAT^® 14.2 enables you to estimate the causal effect of a treatment decision by modeling either the treatment assignment T or the outcome Y, or both. Specifically, modeling the treatment leads to the inverse probability weighting methods, and modeling the outcome leads to the regression methods. Combined modeling of the treatment and outcome leads to doubly robust methods that can provide unbiased estimates for the treatment effect even if one of the models is misspecified. This paper reviews the statistical methods that are implemented in the CAUSALTRT procedure and includes examples of how you can use this procedure to estimate causal effects from observational data. This paper also illustrates some other important features of the CAUSALTRT procedure, including bootstrap resampling, covariate balance diagnostics, and statistical graphics.

Read the paper (PDF)

Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS^® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.

Read the paper (PDF)

Longitudinal count data arise when a subject's outcomes are measured repeatedly over time. Repeated measures count data have an inherent within subject correlation that is commonly modeled with random effects in the standard Poisson regression. A Poisson regression model with random effects is easily fit in SAS^® using existing options in the NLMIXED procedure. This model allows for overdispersion via the nature of the repeated measures; however, departures from equidispersion can also exist due to the underlying count process mechanism. We present an extension of the cross-sectional COM-Poisson (CMP) regression model established by Sellers and Shmueli (2010) (a generalized regression model for count data in light of inherent data dispersion) to incorporate random effects for analysis of longitudinal count data. We detail how to fit the CMP longitudinal model via a user-defined log-likelihood function in PROC NLMIXED. We demonstrate the model flexibility of the CMP longitudinal model via simulated and real data examples.

Read the paper (PDF)

Higher education institutions have a plethora of analytical needs. However, the irregular and inconsistent practices in connecting those needs with appropriate analytical delivery systems have resulted in a patchwork this patchwork sometimes overlaps unnecessarily and sometimes exposes unaddressed gaps. The purpose of this paper is to examine a framework of components for addressing institutional analytical needs, while leveraging existing institutional strengths to maximize analytical goal attainment most effectively and efficiently. The core of this paper is a focused review of components for attaining greater analytical strength and goal attainment in the institution.

Read the paper (PDF)

Having crossed the spectrum from an epidemiologist and researcher (where ad hoc is a way of life and where research is the main focus) to a SAS^® programmer (writing reusable code for automation and batch jobs, which require no manual interventions), I have learned a few things that I wish I had known as a researcher. These things would not only have helped me to be a better SAS programmer, but they also would have saved me time and effort as a researcher by enabling me to have well-organized, accurate code (that I didn't accidentally remove) and code that would work when I ran it again on another date. This poster presents five SAS tips that are common practice among SAS programmers. I provide researchers who use SAS with tips that are handy and useful, and I provide code (where applicable) that they can try out at home. Using the tips provided will make any SAS programmer smile when they are presented with your code (not guaranteed, but your results should not vary by using these tips).

View the e-poster or slides (PDF)

Because many SAS^® users either work for or own companies that house big data, the threat that malicious software poses becomes even more extreme. Malicious software, often abbreviated as malware, includes many different classifications, ways of infection, and methods of attack. This E-Poster highlights the types of malware, detection strategies, and removal methods. It provides guidelines to secure essential assets and prevent future malware breaches.

Read the paper (PDF) | View the e-poster or slides (PDF)

Secondary use of administrative claims data, EHRs and EMRs, registry data, and other data sources within the health data ecosystem provide rich opportunity and potential to study topics ranging from public health surveillance to comparative effectiveness research. Data sourced from individual sites can be limited in their scope, coverage, and statistical power. Sharing and pooling data from multiple sites and sources, however, present administrative, governance, analytic, and patient-privacy challenges. Distributed data networks represent a paradigm shift in health-care data sharing. They have evolved at a critical time when big data and patient privacy are often competing priorities. A distributed data network is one that has no central repository of data. Data reside behind the firewall of each data-contributing partner in a network. Each partner transforms its source data in accordance with a common data model and allows indirect access to data through a standard query approach using flexibly designed informatics tools. This presentation discusses how distributed data networks have matured to make important contributions to the health-care data ecosystem and the evolving Learning Healthcare System. The presentation focuses on 1) the distributed data network and its purpose, concept, guiding principles, and benefits. 2) Common data models and their concepts, designs, and benefits. 3) Analytic tool development and its design and implementation considerations. 4) Analytic chal

Read the paper (PDF)

This panel discusses a wide range of topics related to analytics in higher education. Panelists are from diverse institutions and represent academic research, information technology, and institutional research. Challenges related to data acquisition and quality, system support, and meeting customer needs are covered. Topics such as effective dashboards and reporting, big data, predictive analytics, and more are on the agenda.

Investors usually trade stocks or exchange-traded funds (ETFs) based on a methodology, such as a theory, a model, or a specific chart pattern. There are more than 10,000 securities listed on the US stock market. Picking the right one based on a methodology from so many candidates is usually a big challenge. This paper presents the methodology based on the CANSLIM1 theorem and momentum trading (MT) theorem. We often hear of the cup and handle shape (C&H), double bottoms and multiple bottoms (MB), support and resistance lines (SRL), market direction (MD), fundamental analyses (FA), and technical analyses (TA). Those are all covered in CANSLIM theorem. MT is a trading theorem based on stock moving direction or momentum. Both theorems are easy to learn but difficult to apply without an appropriate tool. The brokers' application system usually cannot provide such filtering due to its complexity. For example, for C&H, where is the handle located? For the MB, where is the last bottom you should trade at? Now, the challenging task can be fulfilled through SAS^®. This paper presents the methods on how to apply the logic and graphically present them though SAS. All SAS users, especially those who work directly on capital market business, can benefit from reading this document to achieve their investment goals. Much of the programming logic can also be adopted in SAS finance packages for clients.

Read the paper (PDF)

Walt Disney once said, 'Of all of our inventions for mass communication, pictures still speak the most universally understood language.' Using data visualization to tell our stories makes analytics accessible to a wider audience than we can reach through words and numbers alone. Through SAS^® Visual Analytics, we can provide insight to a wide variety of audiences, each of whom see the data through a unique lens. Determining the best data and visualizations to provide for users takes concentrated effort and thoughtful planning. This session discusses how Western Kentucky University uses SAS Visual Analytics to provide a wide variety of users across campus with the information they need to visually identify trends, even some they never expected to see, and to answer questions they might not have thought to ask.

Read the paper (PDF)

A new ODS destination for creating Microsoft Excel workbooks is available starting in the third maintenance release for SAS^® 9.4. This destination creates native Microsoft Excel XLSX files, supports graphic images, and offers other advantages over the older ExcelXP tagset. In this presentation, you learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS^® output. The techniques can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! Creating and delivering your workbooks on demand and in real time using SAS server technology is discussed. Using earlier versions of SAS to create multi-sheet workbooks is also discussed. Although the title is similar to previous presentations by this author, this presentation contains new and revised material not previously presented.

Read the paper (PDF) | Download the data file (ZIP)

As a programmer specializing in tracking systems for research projects, I was recently given the task of implementing a newly developed, web-based tracking system for a complex field study. This tracking system uses an SQL database on the back end to hold a large set of related tables. As I learned about the new system, I found that there were deficiencies to overcome to make the system work on the project. Fortunately, I was able to develop a set of utilities in SAS^® to bridge the gaps in the system and to integrate the system with other systems used for field survey administration on the project. The utilities helped to do the following: 1) connect schemas and compare cases across subsystems; 2) compare the statuses of cases across multiple tracked processes; 3) generate merge input files to be used to initiate follow-up activities; 4) prepare and launch SQL stored procedures from a running SAS job; and 5) develop complex queries in Microsoft SQL Server Management Studio and making them run in SAS. This paper puts each of these needs into a larger context by describing the need that is not addressed by the tracking system. Then, each program is explained and documented, with comments.

Read the paper (PDF)

Do you ever feel like you email the same reports to the same people over and over and over again? If your customers are anything like mine, you create reports, and lots of them. Our office is using macros, SAS^® email capabilities, and other programming techniques, in conjunction with our trusty contact list, to automate report distribution. Customers now receive the data they need, and only the data they need, on the schedule they have requested. In addition, not having to send these emails out manually saves our office valuable time and resources that can be used for other initiatives. In this session, we walk through a few of the SAS techniques we are using to provide better service to our internal and external partners and, hopefully, make us look a little more like rock stars.

Read the paper (PDF)

At the University of Central Florida (UCF), Student Development and Enrollment Services (SDES) combined efforts with Institutional Knowledge Management (IKM), which is the official source of data at UCF, to venture in a partnership to bring to life an electronic version of the SDES Dashboard at UCF. Previously, SDES invested over two months in a manual process to create a booklet with graphs and data that was not vetted by IKM; upon review, IKM detected many data errors plus inconsistencies in the figures that had been manually collected by multiple staff members over the years. The objective was to redesign this booklet using SAS^® Web Report Studio. The result was a collection of five major reports. IKM reports use SAS^® Business Intelligence (BI) tools to surface the official UCF data, which is provided to the State of Florida. Now it just takes less than an hour to refresh these reports for the next academic year cycle. Challenges in the design, implementation, usage, and performance are presented.

Read the paper (PDF)

Panel data, which are collected on a set (panel) of individuals over several time points, are ubiquitous in economics and other analytic fields because their structure allows for individuals to act as their own control groups. The PANEL procedure in SAS/ETS^® software models panel data that have a continuous response, and it provides many options for estimating regression coefficients and their standard errors. Some of the available estimation methods enable you to estimate a dynamic model by using a lagged dependent variable as a regressor, thus capturing the autoregressive nature of the underlying process. Including lagged dependent variables introduces correlation between the regressors and the residual error, which necessitates using instrumental variables. This paper guides you through the process of using the typical estimation method for this situation-the generalized method of moments (GMM)-and the process of selecting the optimal set of instrumental variables for your model. Your goal is to achieve unbiased, consistent, and efficient parameter estimates that best represent the dynamic nature of the model.

Read the paper (PDF)

The latest releases of SAS^® Data Management software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud data sources, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS^®, database systems, Hadoop, Spark, SAS^® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS^® Data Management product suite and includes use cases and examples for leveraging product capabilities.