SAS Global Forum 2017 Proceedings

A C D E F G H I K M N P Q S T U

A

Session 1337-2017:

A Data Mining Approach to Predict Students at Risk

With the increasing amount of educational data, educational data mining has become more and more important for uncovering the hidden patterns within institutional data so as to support institutional decision making (Luan 2012). However, only very limited studies have been done on educational data mining for institutional decision support. At the University of Connecticut (UCONN), organic chemistry is a required course for undergraduate students in a STEM discipline. It has a very high DFW rate (D=Drop, F=Failure, W=Withdraw). Take Fall 2014 as an example: the average DFW% for the Organic Chemistry lectures was 24% at UCONN, and there were over 1200 students enrolled in this class. In this study, undergraduate students enrolled during School Year 2010 2011 were used to build up the model. The purpose of this study was to predict student success in the future so as to improve the education quality in our institution. The Sample, Explore, Modify, Model, and Assess (SEMMA) method introduced by SAS was applied to develop the predictive model. The freshmen SAT scores, campus, semester GPA, financial aid, and other factors were used to predict students' performance in this course. In the predictive modeling process, several modeling techniques (decision tree, neural network, ensemble models, and logistic regression) were compared with each other in order to find an optimal one for our institution.

Read the paper (PDF)

Youyou Zheng, University of Connecticut

Thanuja Sakruti, University of Connecticut

Session SAS0727-2017:

Accessibility and SAS^® University Edition: Tips for Students and Professors

Accessibility has become a hot topic on campus due to a flurry of recent investigations of discrimination against students with disabilities by the U.S. Department of Justice and the U.S. Department of Education. This paper provides an update on the latest improvements in SAS^® University Edition that are specifically targeted to enable students with disabilities to excel in the classroom and beyond. This paper covers the entire SAS University Edition user experience including installation, documentation, training, support, using SAS^® Studio, and the new accessibility features in the fourth maintenance release of SAS^® 9.4.

Read the paper (PDF)

Ed Summers, SAS

Amy Peters, SAS

Session 0988-2017:

An Analysis of the Coding Movement: How Code.org and Educators Are Bringing Coding to Every Student

Technology plays an integral role in every aspect of daily life. As a result, educators should leverage technology-based learning to ensure that students are provided with authentic, engaging, and meaningful learning experiences (Pringle, Dawson, and Ritzhaupt, 2015).The significance and value of computer science understanding continue to increase. A major resource that can be credited with spreading support for computer science is the site Code.org. Its mission is to enable every student in every school to have the opportunity to learn computer science (https://code.org/about). Two years ago, our mentor partnered with Code.org to conduct workshops within the Charlotte, NC area to educate teachers on how to teach computer science activities and concepts in their classrooms. We had the opportunity to assist during the workshops to provide student perspectives and opinions. As we look back on the workshops, we wondered, How are the teachers who attended the workshops implementing the concepts they were taught? After each workshop, a survey was distributed to the attendees to receive workshop feedback and to follow up. We collected the data from the surveys sent to participants and analyzed it using SAS^® University Edition. The results of the survey concluded that the workshops were beneficial and that the educators had implemented a concept that they learned. We believe that computer science activity implementations will assist students across the curriculum.

View the e-poster or slides (PDF)

Lauren Cook, University of North Carolina at Charlotte

Talazia Moore, North Carolina State University

Session 0853-2017:

An Information Technology Perspective of SAS^® at The University of Memphis

The University of Memphis has been a SAS^® customer since the mainframe computing era. Our deployments have included various SAS products involving web-based applications, client/server implementations, desktop installations, and virtualized services. This paper uses an information technology (IT) perspective to discuss how the University has leveraged SAS, as well as the latest benefits and challenges for our most recent deployment involving SAS^® Visual Analytics.

Read the paper (PDF)

Robert Jackson, University of Memphis

Session 1251-2017:

Analyzing Correlated Data in SAS^®

Correlated data is extensively used across disciplines when modeling data with any type of correlation that might exist among observations due to clustering or repeated measurements. When modeling clustered data, hierarchical linear modeling (HLM) is a popular multilevel modeling technique that is widely used in different fields such as education and health studies (Gibson and Olejnik, 2003). A typical example of multilevel data involves students nested within classrooms that behave similarly due to shared situational factors. Ignoring their correlation might result in underestimated standard errors and inflated type-I error (Raudenbush and Bryk, 2002). When modeling longitudinal data, many studies have been conducted on continuous outcomes. However, fewer studies on discrete responses over time have been completed. These studies require models within conditional, transitional, and marginal models (Fitzmaurice et al., 2009). Examples of such models that enable researchers to account for the autocorrelation among repeated observations include generalized linear mixed model (GLMM), generalized estimating equations (GEE), alternating logistic regression (ALR), and fixed effects with conditional logit analysis. This study explores the aforementioned methods as well as several other correlated modeling options for longitudinal and hierarchical data within SAS^® 9.4 using real data sets. These procedures include PROC GLIMMIX, PROC GENMOD, PROC NLMIXED, PROC GEE, PROC PHREG, and PROC MIXED.

Read the paper (PDF)

Niloofar Ramezani, University of Northern Colorado

Session 1477-2017:

Analyzing Residuals in a PROC SURVEYLOGISTIC Model

Data from an extensive survey conducted by the National Center for Education Statistics (NCES) is used for predicting qualified secondary school teachers across public schools in the U.S. The sample data includes socioeconomic data at the county level, which is used as a predictor for hiring a qualified teacher. The resultant model is used to score other regions and is presented on a heat map of the U.S. The survey family of procedures that SAS^® offers, such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC, are used in the analyses since the data involves replicate weights. In looking at residuals from a logistic regression, since all the outcomes (observed values) are either 0 or 1, the residuals do not necessarily follow the normal distribution that is so often assumed in residual analysis. Furthermore, in dealing with survey data, the weights of the observations must be accounted for, as these affect the variance of the observations. To adjust for this, rather than looking at the difference in the observed and predicted values, the difference between the expected and actual counts is calculated by using the weights on each observation, and the predicted probability from the logistic model for the observation. Three types of residuals are analyzed: Pearson, Deviance, and Studentized residuals. The purpose is to identify which type of residuals best satisfy the assumption of normality when investigating residuals from a logistic regression.

Read the paper (PDF)

Bogdan Gadidov, Kennesaw State University

C

Session 1409-2017:

Classification Decision Accuracy and Consistency Using SAS/IML^® Software

In this paper, we introduce a SAS/IML^® program of Classification Accuracy and Classification Consistency (CA/CC) that provides useful resources to test analysts or psychometricians. Our program optimizes functions of SAS^® by offering the CA/CC statistics not only with dichotomous items, but also with polytomous items. Classification Decision (CD) is a method to categorize examinees into achievement groups based on cut scores (Quinn and Cheng, 2013). CD has been predominantly used in educational and vocational situations such as admissions, selection, placement, or certification. This method needs to be accurate because its use has been important to examinees' professional and academic futures. Classification Accuracy and Classification Consistency (CA/CC) statistics are indices representing the precision of CD, and they need to be reported in order to affirm the validity of the CD. Classification Accuracy is referred to as the degree to which the classification of observed scores matches with the classification of true scores, and Classification Consistency is defined as the degree to which examinees are classified in the same category when taking two parallel test forms (Lee, 2010). Under item response theory (IRT), there are two methods to calculate CA/CC: Rudner (2001) and Lee (2010) approaches. This research deals with these two approaches for CA/CC with the examinee level.

View the e-poster or slides (PDF)

Sung-Hyuck Lee, ACT

Kyung Yong Kim, University of Iowa

Session 1098-2017:

Classroom Success with SAS^® Grid Manager and SAS^® Visual Analytics: Coping With Big Data

The Institute for Advanced Analytics struggled to provide student computing environments capable of analyzing increasingly larger data sets for its Master of Science in Analytics program. For the fast-paced practicum, the centerpiece of the curriculum, waiting 24 hours for a FREQ procedure to complete was unacceptable. Practicum proposals from industry were pared down (or turned down) because the data sets were too large, depriving students of exciting and relevant learning experiences. By augmenting the practicum architecture with an 18-node computing cluster running SAS^® Grid Manager, SAS^® Visual Analytics, and the latest high-performance SAS^® procedures, we were able to dramatically increase performance and begin accepting terabyte-scale practicum proposals from industry. In this paper, we discuss the benefits and lessons learned through adding these SAS products to our analytics degree program including capability versus complexity tradeoffs, and the state of our current capabilities and limitations with this architecture.

Read the paper (PDF)

John Jernigan, Institute for Advanced Analytics at NC State University

Ken Gahagan, SAS

Cheryl Doninger, SAS

D

Session 1172-2017:

Data Analytics and Visualization Tell Your Story with a Web Reporting Framework Based on SAS^®

For all business analytics projects big or small, the results are used to support business or managerial decision-making processes, and many of them eventually lead to business actions. However, executives or decision makers are often confused and feel uninformed about contents when presented with complicated analytics steps, especially when multi-processes or environments are involved. After many years of research and experiment, a web reporting framework based on SAS^® Stored Processes was developed to smooth the communication between data analysts, researches, and business decision makers. This web reporting framework uses a storytelling style to present essential analytical steps to audiences, with dynamic HTML5 content and drill-down and drill-through functions in text, graph, table, and dashboard formats. No special skills other than SAS^® programming are needed for implementing a new report. The model-view-controller (MVC) structure in this framework significantly reduced the time needed for developing high-end web reports for audiences not familiar with SAS. Additionally, the report contents can be used to feed to tablet or smartphone users. A business analytical example is demonstrated during this session. By using this web reporting framework based on SAS Stored Processes, many existing SAS results can be delivered more effectively and persuasively on a SAS^® Enterprise BI platform.

Read the paper (PDF)

Qiang Li, Locfit LLC

Session 0837-2017:

Data Science Rex: How Data Science Is Evolving (or Facing Extinction) across the Academic Landscape

The discipline of data science has seen an unprecedented evolution from primordial darkness to becoming the academic equivalent of an apex predator on university campuses across the country. But, survival of the discipline is not guaranteed. This session explores the genetic makeup of programs that are likely to survive, the genetic makeup of those that are likely to become extinct, and the role that the business community plays in that evolutionary process.

Read the paper (PDF)

Jennifer Priestley, Kennesaw State University

Session 0886-2017:

Data Validation Using the SAS^® SORT Procedure and MERGE Statement

Data validation plays a key role as an organization engages in a data governance initiative. Better data leads to better decisions. This applies to public schools as well as business entities. Each Local Educational Agency (LEA) in Pennsylvania reports children with disabilities to the Pennsylvania Department of Education (PDE) in compliance with IDEA (Individuals with Disabilities Education Act). PDE provides a Comparison Report to each LEA to assist in their data validation process. This Comparison Report provides counts of various categories for the most recent and previous year. LEAs use the Comparison Report to validate data submitted to PDE. This paper discusses how the Base SAS^® SORT procedure and MERGE statement extract hidden information behind the counts to assist LEAs in their data validation process.

Read the paper (PDF)

Barry Frye, Appalachia Intermediate Unit 8

Session 1472-2017:

Differential Item Functioning Using SAS^®: An Item Response Theory Approach for Graded Responses

Until recently, psychometric analyses of test data within the Item Response Theory (IRT) framework were conducted using specialized, commercial software. However, with the inclusion of the IRT procedure in the suite of SAS^® statistical tools, SAS users can explore the psychometric properties of test items using modern test theory or IRT. Considering the item as the unit of analysis, the relationship between test items and the constructs they measure can be modeled as a function of an unobservable or latent variable. This latent variable or trait (for example, ability or proficiency), vary in the population. However, when examinees having the same trait level do not have the same probability to answering correctly or endorsing an item, we said that such an item might be functioning differently or exhibiting differential item functioning or DIF (Thissen, Steinberg, and Wainer, 2012). This study introduces the implementation of PROC IRT for conducting a DIF analysis for graded responses, using Samejima's graded response model (GRM; Samejima, 1969, 2010). The effectiveness of PROC IRT for evaluation of DIF items is assessed in terms of the Type I error and statistical power of the likelihood ratio test for testing DIF in graded responses.

Patricia Rodríguez de Gil, University of South Florida

Session 0957-2017:

Does Factor Indeterminacy Matter in Multidimensional Item Response Theory?

This paper illustrates proper applications of multidimensional item response theory (MIRT), which is available in SAS^® PROC IRT. MIRT combines item response theory (IRT) modeling and factor analysis when the instrument carries two or more latent traits. Although it might seem convenient to accomplish two tasks simultaneously by using one procedure, users should be cautious of misinterpretations. This illustration uses the 2012 Program for International Student Assessment (PISA) data set collected by Organisation for Economic Co-operation and Development (OECD). Because there are two known sub-domains in the PISA test (reading and math), PROC IRT was programmed to adopt a two-factor solution. In additional, the loading plot, dual plot, item difficulty/discrimination plot, and test information function plot in JMP^® were used to examine the psychometric properties of the PISA test. When reading and math items were analyzed in SAS MIRT, seven to 10 latent factors are suggested. At first glance, these results are puzzling because ideally all items should be loaded into two factors. However, when the psychometric attributes yielded from a two-parameter IRT analysis are examined, it is evident that both the reading and math test items are well written. It is concluded that even if factor indeterminacy is present, it is advisable to evaluate its psychometric soundness based on IRT because content validity can supersede construct validity.

Read the paper (PDF) | View the e-poster or slides (PDF)

Chong Ho Yu, Azusa Pacific University

E

Session 1068-2017:

Establishing an Agile, Self-Service Environment to Empower Agile Analytic Capabilities

Creating an environment that enables and empowers self-service and agile analytic capabilities requires a tremendous amount of working together and extensive agreements between IT and the business. Business and IT users are struggling to know what version of the data is valid, where they should get the data from, and how to combine and aggregate all the data sources to apply analytics and deliver results in a timely manner. All the while, IT is struggling to supply the business with more and more data that is becoming available through many different data sources such as the Internet, sensors, the Internet of Things, and others. In addition, once they start trying to join and aggregate all the different types of data, the manual coding can be very complicated and tedious, can demand extraneous resources and processing, and can negatively impact the overhead on the system. If IT enables agile analytics in a data lab, it can alleviate many of these issues, increase productivity, and deliver an effective self-service environment for all users. This self-service environment using SAS^® analytics in Teradata has decreased the time required to prepare the data and develop the statistical data model, and delivered faster results in minutes compared to days or even weeks. This session discusses how you can enable agile analytics in a data lab, leverage SAS analytics in Teradata to increase performance, and learn how hundreds of organizations have adopted this concept to deliver self-service capabilities in a streamlined process.

Bob Matsey, Teradata

David Hare, SAS

Session 0986-2017:

Estimation of Student Growth Percentile Using SAS^® Procedures

Student growth percentile (SGP) is one of the most widely used score metrics for measuring a student's academic growth. Using longitudinal data, SGP describes a student's growth as the relative standing among students who had a similar level of academic achievement in previous years. Although several models for SGP estimation have been introduced, and some models have been implemented with R, no studies have yet described using SAS^®. As a result, this research describes various types of SGP models and demonstrates how practitioners can use SAS procedures to fit these models. Specifically, this study covers three types of statistical models for SGP: 1) quantile regression-based model 2) conditional cumulative density function-based model 3) multidimensional item response theory-based model. Each of the three models partly uses procedures in SAS, such as PROC QUANTREG, PROC LOGISTIC, PROC TRANSREG, PROC IRT, or PROC MCMC, for its computation. The program code is illustrated using a simulated longitudinal data set over two consecutive years, which is generated by SAS/IML^®. In addition, the interpretation of the estimation results and the advantages and disadvantages of implementing these three approaches in SAS are discussed.

View the e-poster or slides (PDF)

Hongwook Suh, ACT

Robert Ankenmann, The University of Iowa

Session 0788-2017:

Examining Higher Education Performance Metrics with SAS^® Enterprise Miner™ and SAS^® Visual Analytics

Given the proposed budget cuts to higher education in the state of Kentucky, public universities will likely be awarded financial appropriations based on several performance metrics. The purpose of this project was to conceptualize, design, and implement predictive models that addressed two of the state's metrics: six-year graduation rate and fall-to-fall persistence for freshmen. The Western Kentucky University (WKU) Office of Institutional Research analyzed five years' worth of data on first-time, full-time bachelor's degree seeking students. Two predictive models evaluated and scored current students on their likelihood to stay enrolled and their chances of graduating on time. Following an ensemble of machine-learning assessments, the scored data were imported into SAS^® Visual Analytics, where interactive reports allowed users to easily identify which students were at a high risk for attrition or at risk of not graduating on time.

Read the paper (PDF)

Taylor Blaetz, Western Kentucky University

Tuesdi Helbig, Western Kentucky University

Gina Huff, Western Kentucky University

Matt Bogard, Western Kentucky University

F

Session 0863-2017:

Framework for Strategic Analysis in Higher Education

Higher education institutions have a plethora of analytical needs. However, the irregular and inconsistent practices in connecting those needs with appropriate analytical delivery systems have resulted in a patchwork this patchwork sometimes overlaps unnecessarily and sometimes exposes unaddressed gaps. The purpose of this paper is to examine a framework of components for addressing institutional analytical needs, while leveraging existing institutional strengths to maximize analytical goal attainment most effectively and efficiently. The core of this paper is a focused review of components for attaining greater analytical strength and goal attainment in the institution.

Read the paper (PDF)

Glenn James, Tennessee Tech University

Session 0810-2017:

Freedom to Inspire and Achieve Excellence

Innovation in teaching and assessment has become critical for many reasons. This is especially true in the fields of data science and big data analytics. Reasons range from the need to significantly improve the development of soft skills (as reported in an e-skills UK and SAS^® joint report from November 2014), to the rapidly changing software standards of products used by students, to the rapidly increasing range of functionality and product set, to the need to develop lifelong learning skills to learn new software and functionality. And, this is just a few of the reasons. In some educational institutions, it is easy to be extremely innovative. However, in many institutions and countries, there are numerous constraints on the levels of innovation that can be implemented. This presentation captures the author's developing pedagogic practice at the University of Derby. He suggests fundamental changes to the classic approaches to teaching and assessing data science and big data analytics. These changes have resulted in significant improvement in student engagement and achievements and students soft skills. Improvements are illustrated by innovations in teaching SAS to first-year students and teaching IBM Bluemix and Watson Analytics to final-year students. Students have successfully developed both technical and soft skills and experienced excellent levels of achievement.

Read the paper (PDF)

Richard Self, University of Derby

G

Session 1025-2017:

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS^®

The analysis of longitudinal data requires a model that correctly accounts for both the inherent correlation amongst the responses as a result of the repeated measurements, as well as the feedback between the responses and predictors at different time points. Lalonde, Wilson, and Yin (2013) developed an approach based on generalized method of moments (GMM) for identifying and using valid moment conditions to account for time-dependent covariates in longitudinal data with binary outcomes. However, the model developed using this approach does not provide information about the specific relationships that exist across time points. We present a SAS^® macro that extends the work of Lalonde, Wilson, and Yin by using valid moment conditions to estimate and evaluate the relationships between the response and predictors at different time periods. The performance of this method is compared to previously established results.

Read the paper (PDF)

Jeffrey Wilson, Arizona State University

H

Session 0794-2017:

Hands-On Graph Template Language (GTL): Part A

Would you like to be more confident in producing graphs and figures? Do you understand the differences between the OVERLAY, GRIDDED, LATTICE, DATAPANEL, and DATALATTICE layouts? Finally, would you like to learn the fundamental Graph Template Language methods in a relaxed environment that fosters questions? Great this topic is for you! In this hands-on workshop, you are guided through the fundamental aspects of the GTL procedure, and you can try fun and challenging SAS^® graphics exercises to enable you to more easily retain what you have learned.

Read the paper (PDF) | Download the data file (ZIP)

Kriss Harris

Session 0864-2017:

Hands-on Graph Template Language (GTL): Part B

Do you need to add annotations to your graphs? Do you need to specify your own colors on the graph? Would you like to add Unicode characters to your graph, or would you like to create templates that can also be used by non-programmers to produce the required figures? Great, then this topic is for you! In this hands-on workshop, you are guided through the more advanced features of the GTL procedure. There are also fun and challenging SAS^® graphics exercises to enable you to more easily retain what you have learned.

Read the paper (PDF) | Download the data file (ZIP)

Kriss Harris

Session 1512-2017:

Hot Topics for Analytics in Higher Education

This panel discusses a wide range of topics related to analytics in higher education. Panelists are from diverse institutions and represent academic research, information technology, and institutional research. Challenges related to data acquisition and quality, system support, and meeting customer needs are covered. Topics such as effective dashboards and reporting, big data, predictive analytics, and more are on the agenda.

Stephanie Thompson, Datamum

Glenn James, Tennessee Tech University

Robert Jackson, University of Memphis

Sean Mulvenon, University of Arkansas

Carlos Piemonti, University of Central Florida

Richard Dirmyer, Rochester Institute of Technology

I

Session 0826-2017:

Improving the Evaluation of Higher Education: Understanding the Myths, Methods, and Metrics

A growing need in higher education is the more effective use of analytics when evaluating the success of a postsecondary institution. The metrics currently used are simplistic measures of graduation rates, publications, and external funding. These measures offer a limited view of the effectiveness of postsecondary institutions. This paper provides a global perspective of the academic progress of students and the business of higher education. It presents innovative metrics that are more effective in evaluating postsecondary-institutional effectiveness.

Read the paper (PDF)

Sean Mulvenon, University of Arkansas

Session 1513-2017:

Integrating SAS^® Visual Analytics with Google Maps for Analysis and Information Visualization

Exploring, analyzing, and presenting information are strengths of SAS^® Visual Analytics. However, when we need to expand the viewing of this information to an unlimited public outside the boundaries of the organization, we must aggregate geographic information to facilitate interaction and the use of information. This application uses JavaScript and CSS programming language, integrated with SAS^® programming, to present information about 4,239 programs of postgraduate study in Brazil. This information was evaluated by the Brazilian Federal Agency for Support and Evaluation of Graduate Education (CAPES, Brazil) with cartographic precision, enabling the visualization of the data generated in SAS Visual Analytics integrated with Google Maps and Google Street View. Users can select from Brazilian postgraduate programs, know information about the program, learn about theses and dissertations, and see the location of the institution and the campus. The application that is presented can be accessed at http://goo.gl/uAjvGw.

Read the paper (PDF)

Marcus Palheta, CAPES

Sergio Costa Cortes, CAPES

Session 0779-2017:

It All Started with a Mouse: Storytelling with SAS^® Visual Analytics

Walt Disney once said, 'Of all of our inventions for mass communication, pictures still speak the most universally understood language.' Using data visualization to tell our stories makes analytics accessible to a wider audience than we can reach through words and numbers alone. Through SAS^® Visual Analytics, we can provide insight to a wide variety of audiences, each of whom see the data through a unique lens. Determining the best data and visualizations to provide for users takes concentrated effort and thoughtful planning. This session discusses how Western Kentucky University uses SAS Visual Analytics to provide a wide variety of users across campus with the information they need to visually identify trends, even some they never expected to see, and to answer questions they might not have thought to ask.

Read the paper (PDF)

Tuesdi Helbig, Western Kentucky University

K

Session 1069-2017:

Know Your Tools Before You Use

When analyzing data with SAS^®, we often use the SAS DATA step and the SQL procedure to explore and manipulate data. Though they both are useful tools in SAS, many SAS users do not fully understand their differences, advantages, and disadvantages and thus have numerous unnecessary biased debates on them. Therefore, this paper illustrates and discusses these aspects with real work examples, which give SAS users deep insights into using them. Using the right tool for a given circumstance not only provides an easier and more convenient solution, it also saves time and work in programming, thus improving work efficiency. Furthermore, the illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields.

Read the paper (PDF)

Justin Jia, TransUnion

M

Session 1447-2017:

Making SAS^® Education Relevant to the Future Workforce

SAS^® education is a mainstay across disciplines and educational levels in the United States. Along with other courses that are relevant to the jobs students want, independent SAS courses or SAS education integrated into additional courses can help a student be more interesting to a potential employer. The multitude of SAS offerings (SAS^® University Edition, Base SAS^®, SAS^® Enterprise Guide^®, SAS^® Studio, and the SAS^® OnDemand offerings) provide the tools for education, but reaching students where they are is the greatest key for making the education count. This presentation discusses several roadblocks to learning SAS^® syntax or point-and-click from the student perspective and several solutions developed jointly by students and educators in one graduate educational program.

Read the paper (PDF)

Charlotte Baker, Florida A&M University

Matthew Dutton, Florida A&M University

Session 1009-2017:

Manage Your Parking Lot! Must-Haves and Good-to-Haves for a Highly Effective Analytics Team

Every organization, from the most mature to a day-one start-up, needs to grow organically. A deep understanding of internal customer and operational data is the single biggest catalyst to develop and sustain the data. Advanced analytics and big data directly feed into this, and there are best practices that any organization (across the entire growth curve) can adopt to drive success. Analytics teams can be drivers of growth. But to be truly effective, key best practices need to be implemented. These practices include in-the-weeds details, like the approach to data hygiene, as well as strategic practices, like team structure and model governance. When executed poorly, business leadership and the analytics team are unable to communicate with each other they talk past each other and do not work together toward a common goal. When executed well, the analytics team is part of the business solution, aligned with the needs of business decision-makers, and drives the organization forward. Through our engagements, we have discovered best practices in three key areas. All three are critical to analytics team effectiveness. 1) Data Hygiene 2) Complex Statistical Modeling 3) Team Collaboration

Read the paper (PDF)

Aarti Gupta, Bain & Company

Paul Markowitz, Bain & Company

Session 1231-2017:

Modeling Machiavelianism: Predicting Scores with Fewer Factors

Prince Niccolo Machiavelli said things on the order of, The promise given was a necessity of the past: the word broken is a necessity of the present. His utilitarian philosophy can be summed up by the phrase, The ends justify the means. As a personality trait, Machiavelianism is characterized by the drive to pursue one's own goals at the cost of others. In 1970, Richard Christie and Florence L. Geis created the MACH-IV test to assign a MACH score to an individual, using 20 Likert-scaled questions. The purpose of this study was to build a regression model that can be used to predict the MACH score of an individual using fewer factors. Such a model could be useful in screening processes where personality is considered, such as in job screening, offender profiling, or online dating. The research was conducted on a data set from an online personality test similar to the MACH-IV test. It was hypothesized that a statistically significant model exists that can predict an average MACH score for individuals with similar factors. This hypothesis was accepted.

View the e-poster or slides (PDF)

Patrick Schambach, Kennesaw State University

Session 1400-2017:

More than a Report: Mapping the TABULATE Procedure as a Nested Data Object

The TABULATE procedure has long been a central workhorse of our organization's reporting processes, given that it offers a uniquely concise syntax for obtaining descriptive statistics on deeply grouped and nested categories within a data set. Given the diverse output capabilities of SAS^®, it often then suffices to simply ship the procedure's completed output elsewhere via the Output Delivery System (ODS). Yet there remain cases in which we want to not only obtain a formatted result, but also to acquire the full nesting tree and logic by which the computations were made. In these cases, we want to treat the details of the Tabulate statements as data, not merely as presentation. I demonstrate how we have solved this problem by parsing our Tabulate statements into a nested tree structure in JSON that can be transferred and easily queried for deep values elsewhere beyond the SAS program. Along the way, this provides an excellent opportunity to walk through the nesting logic of the procedure's statements and explain how to think about the axes, groupings, and set computations that make it tick. The source code for our syntax parser are also available on GitHub for further use.

Read the paper (PDF)

Jason Phillips, The University of Alabama

Session 0764-2017:

Multi-Group Calibration in SAS^®: The IRT Procedure and SAS/IML^®

In item response theory (IRT), the distribution of examinees' abilities is needed to estimate item parameters. However, specifying the ability distribution is difficult, if not impossible, because examinees' abilities are latent variables. Therefore, IRT estimation programs typically assume that abilities follow a standard normal distribution. When estimating item parameters using two separate computer runs, one problem with this approach is that it causes item parameter estimates obtained from two groups that differ in ability level to be on different scales. There are several methods that can be used to place the item parameter estimates on a common scale, one of which is multi-group calibration. This method is also called concurrent calibration because all items are calibrated concurrently with a single computer run. There are two ways to implement multi-group calibration in SAS^®: 1) Using PROC IRT. 2) Writing an algorithm from scratch using SAS/IML^®. The purpose of this study is threefold. First, the accuracy of the item parameter estimates are evaluated using a simulation study. Second, the item parameter estimates are compared to those produced by an item calibration program flexMIRT. Finally, the advantages and disadvantages of using these two approaches to conduct multi-group calibration are discussed.

View the e-poster or slides (PDF)

Kyung Yong Kim, University of Iowa

Seohee Park, University of Iowa

Jinah Choi, University of Iowa

Hongwook Seo, ACT

N

Session 1440-2017:

Need a Graphic for a Scientific Journal? No Problem!

Graphics are an excellent way to display results from multiple statistical analyses and get a visual message across to the correct audience. Scientific journals often have very precise requirements for graphs that are submitted with manuscripts. While authors often find themselves using tools other than SAS^® to create these graphs, the combination of the SGPLOT procedure and the Output Delivery System enables authors to create what they need in the same place as they conducted their analysis. This presentation focuses on two methods for creating a publication quality graphic in SAS^® 9.4 and provides solutions for some issues encountered when doing so.

Read the paper (PDF)

Charlotte Baker, Florida A&M University

P

Session 1252-2017:

Predicting Successful Math Teachers in Secondary Schools in the United States

Are secondary schools in the United States hiring enough qualified math teachers? In which regions is there a disparity of qualified teachers? Data from an extensive survey conducted by the National Center for Education Statistics (NCES) was used for predicting qualified secondary school teachers across public schools in the US. The three criteria examined to determine whether a teacher is qualified to teach a given subject are: 1) Whether the teacher has a degree in the subject he or she is teaching 2) Whether he or she has a teaching certification in the subject 3) Whether he or she has five years of experience in the subject. A qualified teacher is defined as one who has all three of the previous qualifications. The sample data included socioeconomic data at the county level, which was used as predictors for hiring a qualified teacher. Data such as the number of students on free or reduced lunch at the school was used to assign schools as high-needs or low-needs schools. Other socioeconomic factors included were the income and education levels of working adults within a given school district. Some of the results show that schools with higher-needs students (a school that has more than 40% of the students on some form of reduced lunch program) have less-qualified teachers. The resultant model is used to score other regions and is presented on a heat map of the US. SAS^® procedures such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC are used.

View the e-poster or slides (PDF)

Bogdan Gadidov, Kennesaw State University

Q

Session 0998-2017:

Quick Results with SAS^® University Edition

The announcement of SAS Institute's free SAS^® University Edition is an exciting development for SAS users and learners around the world! The software bundle includes Base SAS^®, SAS/STAT^® software, SAS/IML^® software, SAS^® Studio (user interface), and SAS/ACCESS^® for Windows, with all the popular features found in the licensed SAS versions. This is an incredible opportunity for users, statisticians, data analysts, scientists, programmers, students, and academics everywhere to use (and learn) for career opportunities and advancement. Capabilities include data manipulation, data management, comprehensive programming language, powerful analytics, high-quality graphics, world-renowned statistical analysis capabilities, and many other exciting features. This paper illustrates a variety of powerful features found in the SAS University Edition. Attendees will be shown a number of tips and techniques on how to use the SAS^® Studio user interface, and they will see demonstrations of powerful data management and programming features found in this exciting software bundle.

Read the paper (PDF)

Ryan Lafler

S

Session 1005-2017:

SAS^® Macros for Computing the Mediated Effect in the Pretest-Posttest Control Group Design

Mediation analysis is a statistical technique for investigating the extent to which a mediating variable transmits the relation of an independent variable to a dependent variable. Because it is useful in many fields, there have been rapid developments in statistical mediation methods. The most cutting-edge statistical mediation analysis focuses on the causal interpretation of mediated effect estimates. Cause-and-effect inferences are particularly challenging in mediation analysis because of the difficulty of randomizing subjects to levels of the mediator (MacKinnon, 2008). The focus of this paper is how incorporating longitudinal measures of the mediating and outcome variables aides in the causal interpretation of mediated effects. This paper provides useful SAS^® tools for designing adequately powered studies to detect the mediated effect. Three SAS macros were developed using the powerful but easy-to-use REG, CALIS, and SURVEYSELECT procedures to do the following: (1) implement popular statistical models for estimating the mediated effect in the pretest-posttest control group design; (2) conduct a prospective power analysis for determining the required sample size for detecting the mediated effect; and (3) conduct a retrospective power analysis for studies that have already been conducted and a required sample to detect an observed effect is desired. We demonstrate the use of these three macros with an example.

Read the paper (PDF)

David MacKinnon, Arizona State University

Session 0268-2017:

%SURVEYGENMOD Macro: An Alternative to Deal with Complex Survey Design for the GENMOD Procedure

The purpose of this paper is to show a SAS^® macro named %SURVEYGENMOD developed in a SAS/IML^® procedure as an upgrade of macro %SURVEYGLM developed by Silva and Silva (2014) to deal with complex survey design in generalized linear models (GLMs). The new capabilities are the inclusion of negative binomial distribution, zero-inflated Poisson (ZIP) model, zero-inflated negative binomial (ZINB) model, and the possibility to get estimates for domains. The R function svyglm (Lumley, 2004) and Stata software were used as background, and the results showed that estimates generated by the %SURVEYGENMOD macro are close to the R function and Stata software.

Read the paper (PDF)

Alan Ricardo da Silva, University of Brasilia

Session 1340-2017:

Simplified Project Management Using a SAS^® Visual Analytics Dashboard

The University of Central Florida (UCF) Institutional Knowledge Management (IKM) office provides data analysis and reporting for all UCF divisions. These projects are logged and tracked through the Oracle PeopleSoft content management system (CMS). In the past, projects were monitored via a weekly query pulled using SAS^® Enterprise Guide^®. The output would be filtered and prioritized based on project importance and due dates. A project list would be sent to individual staff members to make updates in the CMS. As data requests were increasing, UCF IKM needed a tool to get a broad overview of the entire project list and more efficiently identify projects in need of immediate attention. A project management dashboard that all IKM staff members can access was created in SAS^® Visual Analytics. This dashboard is currently being used in weekly project management meetings and has eliminated the need to send weekly staff reports.

View the e-poster or slides (PDF)

Andre Watts, University of Central Florida

Danae Barulich, University of Central Florida

Session 0846-2017:

Spawning SAS^® Sleeper Cells and Calling Them into Action: SAS^® University Parallel Processing

With the 2014 launch of SAS^® University Edition, the reach of SAS^® was greatly expanded to educators, students, researchers, non-profits, and the curious, who for the first time could use a full version of Base SAS^® software for free. Because SAS University Edition allows a maximum of two CPUs, however, performance is curtailed sharply from more substantial SAS environments that can benefit from parallel and distributed processing, such as environments that implement SAS^® Grid Manager, Teradata, or Hadoop solutions. Even when comparing performance of SAS University Edition against the most straightforward implementation of the SAS windowing environment, the SAS windowing environment demonstrates greater performance when run on the same computer. With parallel processing and distributed computing becoming the status quo in SAS production environments, SAS University Edition will unfortunately fall behind counterpart SAS solutions if it cannot harness parallel processing best practices and performance. To curb this disparity, this session introduces groundbreaking programmatic methods that enable commodity hardware to be networked so that multiple instances of SAS University Edition can communicate and work collectively to divide and conquer complex tasks. With parallel processing facilitated, a SAS practitioner can now harness an endless number of computers to produce blitzkrieg solutions with the SAS University Edition that rival the performance of more costly, complex infrastructure.

Troy Hughes, Datmesis Analytics

Session SAS0437-2017:

Stacked Ensemble Models for Improved Prediction Accuracy

Ensemble models have become increasingly popular in boosting prediction accuracy over the last several years. Stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to a second level-learning algorithm. This paper shows how you can generate a diverse set of models by various methods (such as neural networks, extreme gradient boosting, and matrix factorizations) and then combine them with popular stacking ensemble techniques, including hill-climbing, generalized linear models, gradient boosted decision trees, and neural nets, by using both the SAS^® 9.4 and SAS^® Visual Data Mining and Machine Learning environments. The paper analyzes the application of these techniques to real-life big data problems and demonstrates how using stacked ensembles produces greater prediction accuracy than individual models and na ve ensembling techniques. In addition to training a large number of models, model stacking requires the proper use of cross validation to avoid overfitting, which makes the process even more computationally expensive. The paper shows how to deal with the computational expense and efficiently manage an ensemble workflow by using parallel computation in a distributed framework.

Read the paper (PDF)

Funda Gunes, SAS

Russ Wolfinger, SAS

Pei-Yi Tan, SAS

Session 0866-2017:

Student Development and Enrollment Services Dashboard at the University of Central Florida

At the University of Central Florida (UCF), Student Development and Enrollment Services (SDES) combined efforts with Institutional Knowledge Management (IKM), which is the official source of data at UCF, to venture in a partnership to bring to life an electronic version of the SDES Dashboard at UCF. Previously, SDES invested over two months in a manual process to create a booklet with graphs and data that was not vetted by IKM; upon review, IKM detected many data errors plus inconsistencies in the figures that had been manually collected by multiple staff members over the years. The objective was to redesign this booklet using SAS^® Web Report Studio. The result was a collection of five major reports. IKM reports use SAS^® Business Intelligence (BI) tools to surface the official UCF data, which is provided to the State of Florida. Now it just takes less than an hour to refresh these reports for the next academic year cycle. Challenges in the design, implementation, usage, and performance are presented.

Read the paper (PDF)

Carlos Piemonti, University of Central Florida

T

Session 1450-2017:

The Effects of Socioeconomic, Demographic Variables on US Mortality Using SAS^® Visual Analytics

Every visualization tells a story. The effectiveness of showing data through visualization becomes clear as these visualizations will tell stories about differences in US mortality using the National Longitudinal Mortality Study (NLMS) data, using the Public-Use Microdata Samples (PUMS) of 1.2 million cases and 122 thousand records of mortality. SAS^® Visual Analytics is a versatile and flexible tool that easily displays the simple effects of differences in mortality rates between age groups, genders, races, places of birth (native or foreign), education and income levels, and so on. Sophisticated analyses including logistical regression (with interactions), decision trees, and neural networks that are displayed in a clear, concise manner help describe more interesting relationships among variables that influence mortality. Some of the most compelling examples are: Males who live alone have a higher mortality rate than females. White men have higher rates of suicide than black men.

Read the paper (PDF) | View the e-poster or slides (PDF)

Catherine Loveless-Schmitt, U.S. Census Bureau

Session 1322-2017:

The Orange Lifestyle

As a freshman at a large university, life can be fun as well as stressful. The choices a freshman makes while in college might impact his or her overall health. In order to examine the overall health and different behaviors of students at Oklahoma State University, a survey was conducted among the freshmen students. The survey focused on capturing the psychological, environmental, diet, exercise, and alcohol and drug use among students. A total of 795 out of 1,036 freshman students completed the survey, which included around 270 questions that covered the range of issues mentioned above. An exploratory factor analysis identified 26 factors. For example, two factors that relate to the behavior of students under stress are eating and relaxing. Further understanding the variables that contribute to alcohol and drug use might help the university in planning appropriate interventions and preventions. Factor analysis with Cronbach's alpha provided insight into a more defined set of variables to help address these types of issues. We used SAS^® to do factor analysis as well as to create different clusters of students with unique characteristics and profiled these clusters

Read the paper (PDF)

Mohit Singhi, Oklahoma State University

Session 0802-2017:

The Truth Is Out There: Leveraging Census Data Using PROC SURVEYLOGISTIC

The advent of robust and thorough data collection has resulted in the term big data. With Census data becoming richer, more nationally representative, and voluminous, we need methodologies that are designed to handle the manifold survey designs that Census data sets implement. The relatively nascent PROC SURVEYLOGISTIC, an experimental procedure in SAS^®9 and fully supported in SAS 9.1, addresses some of these methodologies, including clusters, strata, and replicate weights. PROC SURVEYLOGISTIC handles data that is not a straightforward random sample. Using Census data sets, this paper provides examples highlighting the appropriate use of survey weights to calculate various estimates, as well as the calculation and interpretation of odds ratios between categorical variable interactions when predicting a binary outcome.

Read the paper (PDF)

Richard Dirmyer, Rochester Institute of Technology

Session SAS0289-2017:

The Well-Equipped Student: Using SAS^® University Edition and E-Learning to Gain SAS^® Skills

SAS^® programming skills are much in-demand, and numerous free tools are available for students who want to develop those skills. This paper introduces students to SAS^® Studio and the Jupyter Notebook interface within SAS^® University Edition. To make this introduction more tangible, the paper uses a large data set of baseball statistics as an example. In particular, statistical analysis using SAS^® Studio examines the relationship between salary and performance for major leaguers. From importing text files to creating basic statistics to doing a more advanced analysis, this paper shows multiple ways to carry out tasks so that you can choose whichever method works best for you. Additional statistics that use t tests and linear regression are simple with SAS University Edition. For completeness, the paper shows the same code that is used in SAS Studio examples in the context of Jupyter Notebook in SAS University Edition. The paper also provides additional information about SAS e-learning and SAS Certification to show students how to be fully equipped in order to apply themselves to analytics and data exploration.

Read the paper (PDF)

Randy Mullis, SAS

Allison Mahaffey, SAS

U

Session 1268-2017:

Use SAS^® Enterprise Guide^® and SAS^® Add-In for Microsoft Office to Support Enrollment Forecasting

This presentation explores the steps taken by a large public research institution to develop a five-year enrollment forecasting model to support the critical enrollment management process at an institution. A key component of the process is providing university stakeholders with a self-service, secure, and flexible tool that enables them to quickly generate different enrollment projections using the most up-to-date information as possible in Microsoft Excel. The presentation shows how we integrated both SAS^® Enterprise Guide^® and the SAS^® Add-In for Microsoft Office to support this critical process, which had very specific stakeholder requirements and expectations.

Read the paper (PDF)

Andre Watts, University of Central Florida

Lisa Sklar, University of Central Florida

Session 0854-2017:

Using the LOGISTIC or SURVEYLOGISTIC Procedure and Weighting of Public-Use Data in the Classroom

The rapidly evolving informatics capabilities of the past two decades have resulted in amazing new data-based opportunities. Large public use data sets are now available for easy download and utilization in the classroom. Days of classroom exercises based on static, clean, easily maneuverable samples of 100 or less are over. Instead, we have large and messy real-world data at our fingertips allowing for educational opportunities not available in years past. There are now hundreds of public-use data sets available for download and analysis in the classroom. Many of these sources are survey-based and require the understanding of weighting techniques. These techniques are necessary for proper variance estimation allowing for sound inferences through statistical analysis. This example uses the California Health Interview Survey to present and compare weighted and non-weighted results using the SURVEYLOGISTIC procedure.

Read the paper (PDF)

Tyler Smith, National University

Besa Smith, Analydata