With the advent of the exciting new hybrid field of Data Science, programming and data management skills are in greater demand than ever and have never been easier to attain. Online resources like codecademy and w3schools offer a host of tutorials and assistance to those looking to develop their programming abilities and knowledge. Though their content is limited to languages and tools suited mostly for web developers, the value and quality of these sites are undeniable. To this end, similar tutorials for other free-to-use software applications are springing up. The interactivity of these tutorials elevates them above most, if not all, other out-of-classroom learning tools. The process of learning programming or a new language can be quite disjointed when trying to pair a textbook or similar walk-through material with matching coding tasks and problems. These sites unify these pieces for users by presenting them with a series of short, simple lessons that always require the user to demonstrate their understanding in a coding exercise before progressing. After teaching SAS® in a classroom environment, I became fascinated by the potential for a similar student-driven approach to learning SAS. This could afford me more time to provide individualized attention, as well as open up additional class time to more advanced topics. In this talk, I discuss my development of a series of SAS scripts that walk the user through learning the basics of SAS and that involve programming at every step of the process. This collection of scripts should serve as a self-contained, pseudo-interactive course in SAS basics that students could be asked to complete on their own in a few weeks, leaving the remainder of the term to be spent on more challenging, realistic tasks.
Hunter Glanz, California Polytechnic State University
The new and highly anticipated SAS® Output Delivery System (ODS) destination for Microsoft Excel is finally here! Available as a production feature in the third maintenance release of SAS® 9.4 (TS1M3), this new destination generates native Excel (XLSX) files that are compatible with Microsoft Office 2010 or later. This paper is written for anyone, from entry-level programmers to business analysts, who uses the SAS® System and Microsoft Excel to create reports. The discussion covers features and benefits of the new Excel destination, differences between the Excel destination and the older ExcelXP tagset, and functionality that exists in the ExcelXP tagset that is not available in the Excel destination. These topics are all illustrated with meaningful examples. The paper also explains how you can bridge the gap that exists as a result of differences in the functionality between the destination and the tagset. In addition, the discussion outlines when it is beneficial for you to use the Excel destination versus the ExcelXP tagset, and vice versa. After reading this paper, you should be able to make an informed decision about which tool best meets your needs.
Chevell Parker, SAS
This paper aims to show a SAS® macro for generating random numbers of skew-normal and skew-t distributions, as well as the quantiles of these distributions. The results are similar to those generated by the sn package of R software.
Alan Silva, University of Brasilia
Paulo Henrique Dourado da Silva, Banco do Brasil
Geographically Weighted Negative Binomial Regression (GWNBR) was developed by Silva and Rodrigues (2014). It is a generalization of the Geographically Weighted Poisson Regression (GWPR) proposed by Nakaya and others (2005) and of the Poisson and negative binomial regressions. This paper shows a SAS® macro to estimate the GWNBR model encoded in SAS/IML® software and shows how to use the SAS procedure GMAP to draw the maps.
Alan Silva, University of Brasilia
Thais Rodrigues, University of Brasilia
The world's capacity to store and analyze data has increased in ways that would have been inconceivable just a couple of years ago. Due to this development, large-scale data are collected by governments. Until recently, this was for purely administrative purposes. This study used comprehensive data files on education. The purpose of this study was to examine compulsory courses for a bachelor's degree in the economics program at the University of Copenhagen. The difficulty and use of the grading scale was compared across the courses by using the new IRT procedure, which was introduced in SAS/STAT® 13.1. Further, the latent ability traits that were estimated for all students in the sample by PROC IRT are used as predictors in a logistic regression model. The hypothesis of interest is that students who have a lower ability trait will have a greater probability of dropping out of the university program compared to successful students. Administrative data from one cohort of students in the economics program at the University of Copenhagen was used (n=236). Three unidimensional Item Response Theory models, two dichotomous and one polytomous, were introduced. It turns out that the polytomous Graded Response model does the best job of fitting data. The findings suggest that in order to receive the highest possible grade, the highest level of student ability is needed for the course exam in the first-year course Descriptive Economics A. In contrast, the third-year course Econometrics C is the easiest course in which to receive a top grade. In addition, this study found that as estimated student ability decreases, the probability of a student dropping out of the bachelor's degree program increases drastically. However, contrary to expectations, some students with high ability levels also end up dropping out.
Sara Armandi, University of Copenhagen
When dealing with non-normal categorical response variables, logistic regression is the robust method to use for modeling the relationship between categorical outcomes and different predictors without assuming a linear relationship between them. Within such models, the categorical outcome might be binary, multinomial, or ordinal, and predictors might be continuous or categorical. Another complexity that might be added to such studies is when data is longitudinal, such as when outcomes are collected at multiple follow-up times. Learning about modeling such data within any statistical method is beneficial because it enables researchers to look at changes over time. This study looks at several methods of modeling binary and categorical response variables within regression models by using real-world data. Starting with the simplest case of binary outcomes through ordinal outcomes, this study looks at different modeling options within SAS® and includes longitudinal cases for each model. To assess binary outcomes, the current study models binary data in the absence and presence of correlated observations under regular logistic regression and mixed logistic regression. To assess multinomial outcomes, the current study uses multinomial logistic regression. When responses are ordered, using ordinal logistic regression is required as it allows for interpretations based on inherent rankings. Different logit functions for this model include the cumulative logit, adjacent-category logit, and continuation ratio logit. Each of these models is also considered for longitudinal (panel) data using methods such as mixed models and Generalized Estimating Equations (GEE). The final consideration, which cannot be addressed by GEE, is the conditional logit to examine bias due to omitted explanatory variables at the cluster level. Different procedures for the aforementioned within SAS® 9.4 are explored and their strengths and limitations are specified for applied researchers finding simil
ar data characteristics. These procedures include PROC LOGISTIC, PROC GLIMMIX, PROC GENMOD, PROC NLMIXED, and PROC PHREG.
Niloofar Ramezani, University of Northern Colorado
Massive Open Online Courses (MOOC) have attracted increasing attention in educational data mining research areas. MOOC platforms provide free higher education courses to Internet users worldwide. However, MOOCs have high enrollment but notoriously low completion rates. The goal of this study is apply frequentist and Bayesian logistic regression to investigate whether and how students' engagement, intentions, education levels, and other demographics are conducive to course completion in MOOC platforms. The original data used in this study came from an online eight-week course titled Big Data in Education, taught within the Coursera platform (MOOC) by Teachers College, Columbia University. The data sets for analysis were created from three different sources--clickstream data, a pre-course survey, and homework assignment files. The SAS system provides multiple procedures to perform logistic regression, with each procedure having different features and functions. In this study, we apply two approaches--frequentist and Bayesian logistic regression, to MOOC data. PROC LOGISTIC is used for the frequentist approach, and PROC GENMOD is used for Bayesian analysis. The results obtained from the two approaches are compared. All the statistical analyses are conducted in SAS® 9.3. Our preliminary results show that MOOC students with higher course engagement and higher motivation are more likely to complete the MOOC course.
Yan Zhang, Educational Testing Service
Ryan Baker, Columbia University
Yoav Bergner, Educational Testing Service
Bayesian inference for complex hierarchical models with smoothing splines is typically intractable, requiring approximate inference methods for use in practice. Markov Chain Monte Carlo (MCMC) is the standard method for generating samples from the posterior distribution. However, for large or complex models, MCMC can be computationally intensive, or even infeasible. Mean Field Variational Bayes (MFVB) is a fast deterministic alternative to MCMC. It provides an approximating distribution that has minimum Kullback-Leibler distance to the posterior. Unlike MCMC, MFVB efficiently scales to arbitrarily large and complex models. We derive MFVB algorithms for Gaussian semiparametric multilevel models and implement them in SAS/IML® software. To improve speed and memory efficiency, we use block decomposition to streamline the estimation of the large sparse covariance matrix. Through a series of simulations and real data examples, we demonstrate that the inference obtained from MFVB is comparable to that of PROC MCMC. We also provide practical demonstrations of how to estimate additional posterior quantities of interest from MFVB either directly or via Monte Carlo simulation.
Jason Bentley, The University of Sydney
Cathy Lee, University of Technology Sydney
Higher education is taking on big data initiatives to help improve student outcomes and operational efficiencies. This paper shows how one institution went form White Paper to action. Challenges such as developing a common definition of big data, access to certain types of data, and bringing together institutional groups who had little previous contact are highlighted. The stepwise, collaborative approach taken is highlighted. How the data was eventually brought together and analyzed to generate actionable results is also covered. Learn how to make big data a part of your analytics toolbox.
Stephanie Thompson, Datamum
In the biopharmaceutical industry, biostatistics plays an important and essential role in the research and development of drugs, diagnostics, and medical devices. Familiarity with biostatistics combined with knowledge of SAS® software can lead to a challenging and rewarding career that also improves patients' lives. This paper provides a broad overview of the different types of jobs and career paths available, discusses the education and skill sets needed for each, and presents some ideas for overcoming entry barriers to careers in biostatistics and clinical SAS programming.
Justina Flavin, Independent Consultant
Competing-risks analyses are methods for analyzing the time to a terminal event (such as death or failure) and its cause or type. The cumulative incidence function CIF(j, t) is the probability of death by time t from cause j. New options in the LIFETEST procedure provide for nonparametric estimation of the CIF from event times and their associated causes, allowing for right-censoring when the event and its cause are not observed. Cause-specific hazard functions that are derived from the CIFs are the analogs of the hazard function when only a single cause is present. Death by one cause precludes occurrence of death by any other cause, because an individual can die only once. Incorporating explanatory variables in hazard functions provides an approach to assessing their impact on overall survival and on the CIF. This semiparametric approach can be analyzed in the PHREG procedure. The Fine-Gray model defines a subdistribution hazard function that has an expanded risk set, which consists of individuals at risk of the event by any cause at time t, together with those who experienced the event before t from any cause other than the cause of interest j. Finally, with additional assumptions a full parametric analysis is also feasible. We illustrate the application of these methods with empirical data sets.
Joseph Gardiner, Michigan State University
SAS® Grid Manager, as well as other grid computing technologies, have a set of great capabilities that we, IT professionals, love to have in our systems. This technology increases high availability, allows parallel processing, facilitates increasing demand by scale out, and offers other features that make life better for those managing and using these environments. However, even when business users take advantage of these features, they are more concerned about the business part of the problem. Most of the time business groups hold the budgets and are key stakeholders for any SAS Grid Manager project. Therefore, it is crucial to demonstrate to business users how they will benefit from the new technologies, how the features will improve their daily operations, help them be more efficient and productive, and help them achieve better results. This paper guides you through a process to create a strong and persuasive business plan that translates the technology features from SAS Grid Manager to business benefits.
Marlos Bosso, SAS
Metadata is an integral and critical part of any environment. Metadata facilitates resource discovery and provides unique identification of every single digital component of a system, simple to complex. SAS® Visual Analytics, one of the most powerful analytics visualization platforms, leverages the power of metadata to provide a plethora of functionalities for all types of users. The possibilities range from real-time advanced analytics and power-user reporting to advanced deployment features for a robust and scalable distributed platform to internal and external users. This paper explains the best practices and advanced approaches for designing and managing metadata for a distributed global SAS Visual Analytics environment. Designing and building the architecture of such an environment requires attention to important factors like user groups and roles, access management, data protection, data volume control, performance requirements, and so on. This paper covers how to build a sustainable and scalable metadata architecture through a top-down hierarchical approach. It helps SAS Visual Analytics Data Administrators to improve the platform benchmark through memory mapping, perform administrative data load (AUTOLOAD, Unload, Reload-on-Start, and so on), monitor artifacts of distributed SAS® LASR™ Analytic Servers on co-located Hadoop Distributed File System (HDFS), optimize high-volume access via FullCopies, build customized FLEX themes, and so on. It showcases practical approaches to managing distributed SAS LASR Analytic Servers, offering guest access for global users, managing host accounts, enabling Mobile BI, using power-user reporting features, customizing formats, enabling home page customization, using best practices for environment migration, and much more.
Ratul Saha, Kavi Associates
Vimal Raj Arockiasamy, Kavi Associates
Vignesh Balasubramanian, Kavi Global
Phishing is the attempt of a malicious entity to acquire personal, financial, or otherwise sensitive information such as user names and passwords from recipients through the transmission of seemingly legitimate emails. By quickly alerting recipients of known phishing attacks, an organization can reduce the likelihood that a user will succumb to the request and unknowingly provide sensitive information to attackers. Methods to detect phishing attacks typically require the body of each email to be analyzed. However, most academic institutions do not have the resources to scan individual emails as they are received, nor do they wish to retain and analyze message body data. Many institutions simply rely on the education and participation of recipients within their network. Recipients are encouraged to alert information security (IS) personnel of potential attacks as they are delivered to their mailboxes. This paper explores a novel and more automated approach that uses SAS® to examine email header and transmission data to determine likely phishing attempts that can be further analyzed by IS personnel. Previously a collection of 2,703 emails from an external filtering appliance were examined with moderate success. This paper focuses on the gains from analyzing an additional 50,000 emails, with the inclusion of an additional 30 known attacks. Real-time email traffic is exported from Splunk Enterprise into SAS for analysis. The resulting model aids in determining the effectiveness of alerting IS personnel to potential phishing attempts faster than a user simply forwarding a suspicious email to IS personnel.
Taylor Anderson, University of Alabama
Denise McManus, University of Alabama
As SAS® programmers, we often develop listings, graphs, and reports that need to be delivered frequently to our customers. We might decide to manually run the program every time we get a request, or we might easily schedule an automatic task to send a report at a specific date and time. Both scenarios have some disadvantages. If the report is manual, we have to find and run the program every time someone request an updated version of the output. It takes some time and it is not the most interesting part of the job. If we schedule an automatic task in Windows, we still sometimes get an email from the customers because they need the report immediately. That means that we have to find and run the program for them. This paper explains how we developed an on-demand report platform using SAS® Enterprise Guide®, SAS® Web Application Server, and stored processes. We had developed many reports for different customer groups, and we were getting more and more emails from them asking for updated versions of their reports. We felt we were not using our time wisely and decided to create an infrastructure where users could easily run their programs through a web interface. The tool that we created enables SAS programmers to easily release on-demand web reports with minimum programming. It has web interfaces developed using stored processes for the administrative tasks, and it also automatically customizes the front end based on the user who connects to the website. One of the challenges of the project was that certain reports had to be available to a specific group of users only.
Romain Miralles, Genomic Health
At a community college, there was a need for college employees to quickly and easily find available classroom time slots for the purposes of course scheduling. The existing method was time-consuming and inefficient, and there were no available IT resources to implement a solution. The Office of Institutional Research, which had already been delivering reports using SAS® Enterprise BI Server, created a report called Find an Open Room to fill the need. By combining SAS® programming techniques, a scheduled SAS® Enterprise Guide® project, and a SAS® Web Report Studio report delivered within the SAS® Information Delivery Portal, a report was created that allowed college users to search for available time slots.
Nicole Jagusztyn, Hillsborough Community College
Data-driven decision making is critical for any organization to thrive in this fiercely competitive world. The decision-making process has to be accurate and fast in order to stay a step ahead of the competition. One major problem organizations face is huge data load times in loading or processing the data. Reducing the data loading time can help organizations perform faster analysis and thereby respond quickly. In this paper, we compared the methods that can import data of a particular file type in the shortest possible time and thereby increase the efficiency of decision making. SAS® takes input from various file types (such as XLS, CSV, XLSX, ACCESS, and TXT) and converts that input into SAS data sets. To perform this task, SAS provides multiple solutions (such as the IMPORT procedure, the INFILE statement, and the LIBNAME engine) to import the data. We observed the processing times taken by each method for different file types with a data set containing 65,535 observations and 11 variables. We executed the procedure multiple times to check for variation in processing time. From these tests, we recorded the minimum processing time for the combination of procedure and file type. From our analysis of processing times taken by each importing technique, we observed that the shortest processing times for CSV and TXT files, XLS and XLSX files, and ACCESS files are the INFILE statement, the LIBNAME engine, and PROC IMPORT, respectively.
Divya Dadi, Oklahoma State University
Rahul Jhaver, Oklahoma State University
As Data Management professionals, you have to comply with new regulations and controls. One such regulation is Basel Committee on Banking Supervision (BCBS) 239. To respond to these new demands, you have to put processes and methods in place to automate metadata collection and analysis, and to provide rigorous documentation around your data flows. You also have to deal with many aspects of data management including data access, data manipulation (ETL and other), data quality, data usage, and data consumption, often from a variety of toolsets that are not necessarily from a single vendor. This paper shows you how to use SAS® technologies to support data governance requirements, including third party metadata collection and data monitoring. It highlights best practices such as implementing a business glossary and establishing controls for monitoring data. Attend this session to become familiar with the SAS tools used to meet the new requirements and to implement a more managed environment.
Jeff Stander, SAS
The experimental item response theory procedure (PROC IRT), included in recently released SAS/STAT® 13.1 and 13.2, enables item response modeling and trait estimation in SAS®. PROC IRT enables you to perform item parameter calibration and latent trait estimation using a wide spectrum of educational and psychological research. This paper evaluates the performance of PROC IRT in item parameter recovery and under various testing conditions. The pros and cons of PROC IRT versus BILOG-MG 3.0 are presented. For practitioners of IRT models, the development of IRT-related analysis in SAS is inspiring. This analysis offers a great choice to the growing population of IRT users. A shift to SAS can be beneficial based on several features of SAS: its flexibility in data management, its power in data analysis, its convenient output delivery, and its increasing richness in graphical presentation. It is critical to ensure the quality of item parameter calibration and trait estimation before you can continue with other components, such as test scoring, test form constructions, IRT equatings, and so on.
Yi-Fang Wu, ACT, Inc.
The use of logistic models for independent binary data has relied first on asymptotic theory and later on exact distributions for small samples, as discussed by Troxler, Lalonde, and Wilson (2011). While the use of logistic models for dependent analysis based on exact analyses is not common, it is usually presented in the case of one-stage clustering. We present a SAS® macro that allows the testing of hypotheses using exact methods in the case of one-stage and two-stage clustering for small samples. The accuracy of the method and the results are compared to results obtained using an R program.
Kyle Irimata, Arizona State University
Jeffrey Wilson, Arizona State University
Do you create complex reports using PROC REPORT? Are you confused by the COMPUTE BLOCK feature of PROC REPORT? Are you even aware of it? Maybe you already produce reports using PROC REPORT, but suddenly your boss needs you to modify some of the values in one or more of the columns. Maybe your boss needs to see the values of some rows in boldface and others highlighted in a stylish yellow. Perhaps one of the columns in the report needs to display a variety of fashionable formats (some with varying decimal places and some without any decimals). Maybe the customer needs to see a footnote in specific cells of the report. Well, if this sounds familiar then come take a look at the COMPUTE BLOCK of PROC REPORT. This paper shows a few tips and tricks of using the COMPUTE DEFINE block with conditional IF/THEN logic to make your reports stylish and fashionable. The COMPUTE BLOCK allows you to use data DATA step code within PROC REPORT to provide customization and style to your reports. We'll see how the Census Bureau produces a stylish demographic profile for customers of its Special Census program using PROC REPORT with the COMPUTE BLOCK. The paper focuses on how to use the COMPUTE BLOCK to create this stylish Special Census profile. The paper shows quick tips and simple code to handle multiple formats within the same column, make the values in the Total rows boldface, trafficlighting, and how to add footnotes to any cell based on the column or row. The Special Census profile report is an Excel table created with ODS tagsets.ExcelXP that is stylish and fashionable, thanks in part to the COMPUTE BLOCK.
Chris Boniface, Census Bureau
If your organization already deploys one or more software solutions via Amazon Web Services (AWS), you know the value of the public cloud. AWS provides a scalable public cloud with a global footprint, allowing users access to enterprise software solutions anywhere at any time. Although SAS® began long before AWS was even imagined, many loyal organizations driven by SAS are moving their local SAS analytics into the public AWS cloud, alongside other software hosted by AWS. SAS® Solutions OnDemand has assisted organizations in this transition. In this paper, we describe how we extended our enterprise hosting business to AWS. We describe the open source automation framework from which SAS Soultions onDemand built our automation stack, which simplified the process of migrating a SAS implementation. We'll provide the technical details of our automation and network footprint, a discussion of the technologies we chose along the way, and a list of lessons learned.
Ethan Merrill, SAS
Bryan Harkola, SAS
Project management is a hot topic across many industries, and there are multiple commercial software applications for managing projects available. The reality, however, is that the majority of project management software is not applicable for daily usage. SAS® has a solution for this issue that can be used for managing projects graphically in real time. This paper introduces a new paradigm for project management using the SAS® Graph Template Language (GTL). SAS clients, in real time, can use GTL to visualize resource assignments, task plans, delivery tracking, and project status across multiple project levels for more efficient project management.
Zhouming(Victor) Sun, Medimmune
Have you ever wondered how to get the most from Web 2.0 technologies in order to visualize SAS® data? How to make those graphs dynamic, so that users can explore the data in a controlled way, without needing prior knowledge of SAS products or data science? Wonder no more! In this session, you learn how to turn basic sashelp.stocks data into a snazzy HighCharts stock chart in which a user can review any time period, zoom in and out, and export the graph as an image. All of these features with only two DATA steps and one SORT procedure, for 57 lines of SAS code.
Vasilij Nevlev, Analytium Ltd
Microsoft Visual Basic Scripting Edition (VBScript) and SAS® software are each powerful tools in their own right. These two technologies can be combined so that SAS code can call a VBScript program or vice versa. This gives a programmer the ability to automate SAS tasks; traverse the file system; send emails programmatically via Microsoft Outlook or SMTP; manipulate Microsoft Word, Microsoft Excel, and Microsoft PowerPoint files; get web data; and more. This paper presents example code to demonstrate each of these capabilities.
Christopher Johnson, BrickStreet Insurance
In studies where randomization is not possible, imbalance in baseline covariates (confounding by indication) is a fundamental concern. Propensity score matching (PSM) is a popular method to minimize this potential bias, matching individuals who received treatment to those who did not, to reduce the imbalance in pre-treatment covariate distributions. PSM methods continue to advance, as computing resources expand. Optimal matching, which selects the set of matches that minimizes the average difference in propensity scores between mates, has been shown to outperform less computationally intensive methods. However, many find the implementation daunting. SAS/IML® software allows the integration of optimal matching routines that execute in R, e.g. the R optmatch package. This presentation walks through performing optimal PSM in SAS® through implementing R functions, assessing whether covariate trimming is necessary prior to PSM. It covers the propensity score analysis in SAS, the matching procedure, and the post-matching assessment of covariate balance using SAS/STAT® 13.2 and SAS/IML procedures.
Lucy D'Agostino McGowan, Vanderbilt University
Robert Greevy, Department of Biostatistics, Vanderbilt University
Considering the fact that SAS® Grid Manager is becoming more and more popular, it is important to fulfill the user's need for a successful migration to a SAS® Grid environment. This paper focuses on key requirements and common issues for new SAS Grid users, especially if they are coming from a traditional environment. This paper describes a few common requirements like the need for a current working directory, the change of file system navigation in SAS® Enterprise Guide® with user-given location, getting job execution summary email, and so on. The GRIDWORK directory has been introduced in SAS Grid Manager, which is a bit different from the traditional SAS WORK location. This paper explains how you can use the GRIDWORK location in a more user-friendly way. Sometimes users experience data set size differences during grid migration. A few important reasons for data set size difference are demonstrated. We also demonstrate how to create new custom scripts as per business needs and how to incorporate them with SAS Grid Manager engine.
Piyush Singh, TATA Consultancy Services Ltd
Tanuj Gupta, TATA Consultancy Services
Prasoon Sangwan, Tata consultancy services limited
This paper presents the use of latent class analysis (LCA) to base the identification of a set of mutually exclusive latent classes of individuals on responses to a set of categorical, observed variables. The LCA procedure, a user-defined SAS® procedure for conducting LCA and LCA with covariates, is demonstrated using follow-up data on substance use from Monitoring the Future panel data, a nationally representative sample of high school seniors who are followed at selected time points during adulthood. The demonstration includes guidance on data management prior to analysis, PROC LCA syntax requirements and options, and interpretation of output.
Patricia Berglund, University of Michigan
Business Intelligence users analyze business data in a variety of ways. Seventy percent of business data contains location information. For in-depth analysis, it is essential to combine location information with mapping. New analytical capabilities are added to SAS® Visual Analytics, leveraging the new partnership with Esri, a leader in location intelligence and mapping. The new capabilities enable users to enhance the analytical insights from SAS Visual Analytics. This paper demonstrates and discusses the new partnership with Esri and the new capabilities added to SAS Visual Analytics.
Murali Nori, SAS
Himesh Patel, SAS
For SAS® Enterprise Guide® users, sometimes macro variables and their values need to be brought over to the local workspace from the server, especially when multiple data sets or outputs need to be written to separate files in a local drive. Manually retyping the macro variables and their values in the local workspace after they have been created on the server workspace would be time-consuming and error-prone, especially when we have quite a number of macro variables and values to bring over. Instead, this task can be achieved in an efficient manner by using dictionary tables and the CALL SYMPUT routine, as illustrated in more detail below. The same approach can also be used to bring macro variables and their values from the local to the server workspace.
Khoi To, Office of Planning and Decision Support, Virginia Commonwealth University
When analyzing data with SAS®, we often encounter missing or null values in data. Missing values can arise from the availability, collectibility, or other issues with the data. They represent the imperfect nature of real data. Under most circumstances, we need to clean, filter, separate, impute, or investigate the missing values in data. These processes can take up a lot of time, and they are annoying. For these reasons, missing values are usually unwelcome and need to be avoided in data analysis. There are two sides to every coin, however. If we can think outside the box, we can take advantage of the negative features of missing values for positive uses. Sometimes, we can create and use missing values to achieve our particular goals in data manipulation and analysis. These approaches can make data analyses convenient and improve work efficiency for SAS programming. This kind of creative and critical thinking is the most valuable quality for data analysts. This paper exploits real-world examples to demonstrate the creative uses of missing values in data analysis and SAS programming, and discusses the advantages and disadvantages of these methods and approaches. The illustrated methods and advanced programming skills can be used in a wide variety of data analysis and business analytics fields.
Justin Jia, Trans Union Canada
Shan Shan Lin, CIBC
You've heard all the talk about SAS® Visual Analytics--but maybe you are still confused about how the product would work in your SAS® environment. Many customers have the same points of confusion about what they need to do with their data, how to get data into the product, how SAS Visual Analytics would benefit them, and even should they be considering Hadoop or the cloud. In this paper, we cover the questions we are asked most often about implementation, administration, and usage of SAS Visual Analytics.
Tricia Aanderud, Zencos Consulting LLC
Ryan Kumpfmiller, Zencos Consulting
Nick Welke, Zencos Consulting
Inspired by Christianna William's paper on transitioning to PROC SQL from the DATA step, this paper aims to help SQL programmers transition to SAS® by using PROC SQL. SAS adapted the Structured Query Language (SQL) by means of PROC SQL back with SAS®6. PROC SQL syntax closely resembles SQL. However, there are some SQL features that are not available in SAS. Throughout this paper, we outline common SQL tasks and how they might differ in PROC SQL. We also introduce useful SAS features that are not available in SQL. Topics covered are appropriate for novice SAS users.
Barbara Ross, NA
Jessica Bennett, Snap Finance
The North Carolina Community College System office can quickly and easily enable colleges to compare their program's success to other college programs. Institutional researchers can now spend their days quickly looking at trends, abnormalities, and other colleges, compared to spending their days digging for data to load into a Microsoft Excel spreadsheet. We look at performance measures and how programs are being graded using SAS® Visual Analytics.
Bill Schneider, North Carolina Community College System
SAS® software provides many DATA step functions that search and extract patterns from a character string, such as SUBSTR, SCAN, INDEX, TRANWRD, etc. Using these functions to perform pattern matching often requires you to use many function calls to match a character position. However, using the Perl regular expression (PRX) functions or routines in the DATA step improves pattern-matching tasks by reducing the number of function calls and making the program easier to maintain. This talk, in addition to discussing the syntax of Perl regular expressions, demonstrates many real-world applications.
Arthur Li, City of Hope
Many inquisitive minds are filled with excitement and anticipation of response every time one posts a question on a forum. This paper explores the factors that impact the response time of the first response for questions posted in the SAS® Community forum. The factors are contributors' availability, nature of topic, and number of contributors knowledgeable for that particular topic. The results from this project help SAS® users receive an estimated response time, and the SAS Community forum can use this information to answer several business questions such as following: What time of the year is likely to have an overflow of questions? Do specific topics receive delayed responses? Which days of the week are the community most active? To answer such questions, we built a web crawler using Python and Selenium to fetch data from the SAS Community forum, one of the largest analytics groups. We scraped over 13,443 queries and solutions starting from January 2014 to present. We also captured several query-related attributes such as the number of replies, likes, views, bookmarks, and the number of people conversing on the query. Using different tools, we analyzed this data set after clustering the queries into 22 subtopics and found interesting patterns that can help the SAS Community forum in several ways, as presented in this paper.
Praveen Kumar Kotekal, Oklahoma State University
The Oklahoma State Department of Health (OSDH) conducts home visiting programs with families that need parental support. Domestic violence is one of the many screenings performed on these visits. The home visiting personnel are trained to do initial screenings; however, they do not have the extensive information required to treat or serve the participants in this arena. Understanding how demographics such as age, level of education, and household income among others, are related to domestic violence might help home visiting personnel better serve their clients by modifying their questions based on these demographics. The objective of this study is to better understand the demographic characteristics of those in the home visiting programs who are identified with domestic violence. We also developed predictive models such as logistic regression and decision trees based on understanding the influence of demographics on domestic violence. The study population consists of all the women who participated in the Children First Program of the OSDH from 2012 to 2014. The data set contains 1,750 observations collected during screening by the home visiting personnel over the two-year period. In addition, they must have completed the Demographic form as well as the Relationship Assessment form at the time of intake. Univariate and multivariate analysis has been performed to discover the influence that age, education, and household income have on domestic violence. From the initial analysis, we can see that women who are younger than 25 years old, who haven't completed high school, and who are somewhat dependent on their husbands or partners for money are most vulnerable. We have even segmented the clients based on the likelihood of domestic violence.
Soumil Mukherjee, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Miriam McGaugh, Oklahoma state department of Health
In a data warehousing system, change data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also in providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This and slowly changing dimensions (SCD) create a cycle that runs the DWH and provides valuable insights in the history and for the decision-making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth. If these two processes can be combined into a single process where just one single transform does both jobs of identifying the change and applying the change to the DWH, then we can save significant processing times and value resources of the system. Hence, I came up with a hybrid SCD with CDC approach for this. My paper focuses on sources that DO NOT have CDC in their sources and need to perform SCD Type 2 on such records without worrying about data duplications and increased processing times.
Vishant Bhat, University of Newcastle
Tony Blanch, SAS Consultant
Horizontal data sorting is a very useful SAS® technique in advanced data analysis when you are using SAS programming. Two years ago (SAS® Global Forum Paper 376-2013), we presented and illustrated various methods and approaches to perform horizontal data sorting, and we demonstrated its valuable application in strategic data reporting. However, this technique can also be used as a creative analytic method in advanced business analytics. This paper presents and discusses its innovative and insightful applications in product purchase sequence analyses such as product opening sequence analysis, product affinity analysis, next best offer analysis, time-span analysis, and so on. Compared to other analytic approaches, the horizontal data sorting technique has the distinct advantages of being straightforward, simple, and convenient to use. This technique also produces easy-to-interpret analytic results. Therefore, the technique can have a wide variety of applications in customer data analysis and business analytics fields.
Justin Jia, Trans Union Canada
Shan Shan Lin, CIBC
Donorschoose.org is a nonprofit organization that allows individuals to donate directly to public school classroom projects. Teachers from public schools post a request for funding a project with a short essay describing it. Donors all around the world can look at these projects when they log in to Donorschoose.org and donate to projects of their choice. The idea is to have a personalized recommendation webpage for all the donors, which will show them the projects, which they prefer, like and love to donate. Implementing a recommender system for the DonorsChoose.org website will improve user experience and help more projects meet their funding goals. It also will help us in understanding the donors' preferences and delivering to them what they want or value. One type of recommendation system can be designed by predicting projects that will be less likely to meet funding goals, segmenting and profiling the donors and using that information for recommending right projects when the donors log in to DonorsChoose.org.
Heramb Joshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Sandeep Chittoor, Student
Vignesh Dhanabal, Oklahoma State University
Sharat Dwibhasi, Oklahoma State University
Representational State Transfer (REST) is being used across the industry for designing networked applications to provide lightweight and powerful alternatives to web services such as SOAP and Web Services Description Language (WSDL). Since REST is based entirely on HTTP, SAS® provides everything you need to make REST calls and to process structured and unstructured data alike. This paper takes a look at how some enhancements in the third maintenance release of SAS® 9.4 can benefit you in this area. Learn how the HTTP procedure and other SAS language features provide everything you need to simply and securely use REST.
Joseph Henry, SAS
Longitudinal data with time-dependent covariates is not readily analyzed as there are inherent, complex correlations due to the repeated measurements on the sampling unit and the feedback process between the covariates in one time period and the response in another. A generalized method of moments (GMM) logistic regression model (Lalonde, Wilson, and Yin 2014) is one method for analyzing such correlated binary data. While GMM can account for the correlation due to both of these factors, it is imperative to identify the appropriate estimating equations in the model. Cai and Wilson (2015) developed a SAS® macro using SAS/IML® software to fit GMM logistic regression models with extended classifications. In this paper, we expand the use of this macro to allow for continuous responses and as many repeated time points and predictors as possible. We demonstrate the use of the macro through two examples, one with binary response and another with continuous response.
Katherine Cai, Arizona State University
Jeffrey Wilson, Arizona State University
Mobile devices are an integral part of a business professional's life. These mobile devices are getting increasingly powerful in terms of processor speeds and memory capabilities. Business users can benefit from a more analytical visualization of the data along with their business context. The new SAS® Mobile BI contains many enhancements that facilitate the use of SAS® Analytics in the newest version of SAS® Visual Analytics. This paper demonstrates how to use the new analytical visualization that has been added to SAS Mobile BI from SAS Visual Analytics, for a richer and more insightful experience for business professionals on the go.
Murali Nori, SAS
Coronal mass ejections (CMEs) are massive explosions of magnetic field and plasma from the Sun. While responsible for the northern lights, these eruptions can cause geomagnetic storms and cataclysmic damage to Earth's telecommunications systems and power grid infrastructures. Hence, it is imperative to construct highly accurate predictive processes to determine whether an incoming CME will produce devastating effects on Earth. One such process, called stacked generalization, trains a variety of models, or base-learners, on a data set. Then, using the predictions from the base-learners, another model is trained to learn from the metadata. The goal of this meta-learner is to deduce information about the biases from the base-learners to make more accurate predictions. Studies have shown success in using linear methods, especially within regularization frameworks, at the meta-level to combine the base-level predictions. Here, SAS® Enterprise Miner™ 13.1 is used to reinforce the advantages of regularization via the Least Absolute Shrinkage and Selection Operator (LASSO) on this type of metadata. This work compares the LASSO model selection method to other regression approaches when predicting the occurrence of strong geomagnetic storms caused by CMEs.
Taylor Larkin, The University of Alabama
Denise McManus, University of Alabama
Teaching online courses is very different from teaching in the classroom setting. Developing and delivering an effective online class entails more than just transferring traditional course materials into written documents and posting them in a course shell. This paper discusses the author's experience in converting a traditional hands-on introductory SAS® programming class into an online course and presents some ideas for promoting successful learning and knowledge transfer when teaching online.
Justina Flavin, Independent Consultant
Big data, small data--but what about when you have no data? Survey data commonly include missing values due to nonresponse. Adjusting for nonresponse in the analysis stage might lead different analysts to use different, and inconsistent, adjustment methods. To provide the same complete data to all the analysts, you can impute the missing values by replacing them with reasonable nonmissing values. Hot-deck imputation, the most commonly used survey imputation method, replaces the missing values of a nonrespondent unit by the observed values of a respondent unit. In addition to performing traditional cell-based hot-deck imputation, the SURVEYIMPUTE procedure, new in SAS/STAT® 14.1, also performs more modern fully efficient fractional imputation (FEFI). FEFI is a variation of hot-deck imputation in which all potential donors in a cell contribute their values. Filling in missing values is only a part of PROC SURVEYIMPUTE. The real trick is to perform analyses of the filled-in data that appropriately account for the imputation. PROC SURVEYIMPUTE also creates a set of replicate weights that are adjusted for FEFI. Thus, if you use the imputed data from PROC SURVEYIMPUTE along with the replicate methods in any of the survey analysis procedures--SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, or SURVEYPHREG--you can be confident that inferences account not only for the survey design, but also for the imputation. This paper discusses different approaches for handling nonresponse in surveys, introduces PROC SURVEYIMPUTE, and demonstrates its use with real-world applications. It also discusses connections with other ways of handling missing values in SAS/STAT.
Pushpal Mukhopadhyay, SAS
Universities, government agencies, and non-profits all require various levels of data analysis, data manipulation, and data management. This requires a workforce that knows how to use statistical packages. The server-based SAS® OnDemand for Academics: Studio offerings are excellent tools for teaching SAS in both postsecondary education and in professional continuing education settings. SAS OnDemand for Academics: Studio can be used to share resources; teach users using Windows, UNIX, and Mac OS computers; and teach in various computing environments. This paper discusses why one might use SAS OnDemand for Academics: Studio instead of SAS® University Edition, SAS® Enterprise Guide®, or Base SAS® for education and provides examples of how SAS OnDemand for Academics: Studio has been used for in-person and virtual training.
Charlotte Baker, Florida A&M University
C. Perry Brown, Florida Agricultural and Mechanical University
This paper discusses a set of practical recommendations for optimizing the performance and scalability of your Hadoop system using SAS®. Topics include recommendations gleaned from actual deployments from a variety of implementations and distributions. Techniques cover tips for improving performance and working with complex Hadoop technologies such as Kerberos, techniques for improving efficiency when working with data, methods to better leverage the SAS in Hadoop components, and other recommendations. With this information, you can unlock the power of SAS in your Hadoop system.
Nancy Rausch, SAS
Wilbram Hazejager, SAS
All public schools in the United States require health and safety education for their students. Furthermore, almost all states require driver education before minors can obtain a driver's license. Through extensive analysis of the Fatality Analysis Reporting System data, we have concluded that from 2011-2013 an average of 12.1% of all individuals killed in a motor vehicle accident in the United States, District of Columbia, and Puerto Rico were minors (18 years or younger). Our goal is to offer insight within our analysis in order to better road safety education to prevent future premature deaths involving motor vehicles.
Molly Funk, Bryant University
Max Karsok, Bryant University
Michelle Williams, Bryant University
The HPSUMMARY procedure provides data summarization tools to compute basic descriptive statistics for variables in a SAS® data set. It is a high-performance version of the SUMMARY procedure in Base SAS®. Though PROC SUMMARY is popular with data analysts, PROC HPSUMMARY is still a new kid on the block. The purpose of this paper is to provide an introduction to PROC HPSUMMARY by comparing it with its well-known counterpart, PROC SUMMARY. The comparison focuses on differences in syntax, options, and performance in terms of execution time and memory usage. Sample code, outputs, and SAS log snippets are provided to illustrate the discussion. Simulated large data is used to observe the performance of the two procedures. Using SAS® 9.4 installed on a single-user machine with four cores available, preliminary experiments that examine performance of the two procedures show that HPSUMMARY is more memory-efficient than SUMMARY when the data set is large. (For example, SUMMARY failed due to insufficient memory, whereas HPSUMMARY finished successfully). However, there is no evidence of a performance advantage of the HPSUMMARY over the SUMMARY procedures in this single-user machine.
Anh Kellermann, University of South Florida
Jeff Kromrey, University of South Florida
For marketers who are responsible for identifying the best customer to target in a campaign, it is often daunting to determine which media channel, offer, or campaign program is the one the customer is more apt to respond to, and therefore, is more likely to increase revenue. This presentation examines the components of designing campaigns to identify promotable segments of customers and to target the optimal customers using SAS® Marketing Automation integrated with SAS® Marketing Optimization.
Pamela Dixon, SAS
Specifying colors based on group value is a quite popular practice in visualizing data, but it is not so easy to do, especially when there are multiple group values. This paper explores three different methods to dynamically assign colors to plots based on their group values. They are combining EVAL and IFN functions in the plot statements; bringing the DISCRETEATTRMAP block into the plot statements; and using the macro from the SAS® sample 40255.
Amos Shu, MedImmune
This paper shows how SAS® can be used to obtain a Time Series Analysis of data regarding World War II. This analysis tests whether Truman's justification for the use of atomic weapons was valid. Truman believed that by using the atomic weapons, he would prevent unacceptable levels of U.S. casualties that would be incurred in the course of a conventional invasion of the Japanese islands.
Rachael Becker, University of Central Florida
Educational systems at the district, state, and national levels all report possessing amazing student-level longitudinal data systems (LDS). Are the LDS systems improving educational outcomes for students? Are they guiding development of effective instructional practices? Are the standardized exams measuring student knowledge relative to the learning expectations? Many questions exist about the effective use of the LDS system and educational data, but data architecture and analytics (including the products developed by SAS®) are not designed to answer any of these questions. However, the ability to develop more effective educational interfaces, improve use of data to the classroom level, and improve student outcomes, might only be available through use of SAS. The purpose of this session and paper is to demonstrate an integrated use of SAS tools to guide the transformation of data to analytics that improve educational outcomes for all students.
Sean Mulvenon, University of Arkansas
At the University of Central Florida (UCF), we recently invested in SAS® Visual Analytics, along with the updated SAS® Business Intelligence platform (from 9.2 to 9.4), a project that took over a year to be completed. This project was undertaken to give our users the best and most updated tools available. This paper introduces the SAS Visual Analytics environment at UCF and includes projects created using this product. It answers why we selected SAS Visual Analytics for development over other SAS® applications. It explains the technical environment for our non-distributed SAS Visual Analytics: RAM, servers, benchmarking, sizing, and scaling. It discusses why we chose the non-distributed mode versus distributed mode. Challenges in the design, implementation, usage, and performance are also presented, including the reasons why Hadoop was not adopted.
Scott Milbuta, University of Central Florida
Ulf Borjesson, University of Central Florida
Carlos Piemonti, University of Central Florida
The Statistical Graphics (SG) procedures and the Graph Template Language (GTL) are capable of generating powerful individual data displays. What if one wanted to visualize how a distribution changes with different parameters, or to view multiple aspects of a three-dimensional plot, just as two examples? By using macros to generate a graph for each frame, combined with the ODS PRINTER destination, it is possible to create GIF files to create effective animated data displays. This paper outlines the syntax and strategy necessary to generate these displays and provides a handful of examples. Intermediate knowledge of PROC SGPLOT, PROC TEMPLATE, and the SAS® macro language is assumed.
Jesse Pratt, Cincinnati Children's Hospital Medical Center
Many SAS® procedures can be used to analyze large amounts of correlated data. This study was a secondary analysis of data obtained from the South Carolina Revenue and Fiscal Affairs Office (RFA). The data includes medical claims from all health care systems in South Carolina (SC). This study used the SAS procedure GENMOD to analyze a large amount of correlated data about Military Health Care (MHS) system beneficiaries who received behavioral health care in South Carolina Health Care Systems from 2005 to 2014. Behavioral health (BH) was defined by Major Diagnostic Code (MDC) 19 (mental disorders and diseases) and 20 (alcohol/drug use). MDCs are formed by dividing all possible principal diagnoses from the International Classification Diagnostic (ICD-9) codes into 25 mutually exclusive diagnostic categories. The sample included a total of 6,783 BH visits and 4,827 unique adult and child patients that included military service members, veterans, and their adult and child dependents who have MHS insurance coverage. PROC GENMOD included a multivariate GEE model with type of BH visit (mental health or substance abuse) as the dependent variable, and with gender, race group, age group, and discharge year as predictors. Hospital ID was used in the repeated statement with different correlation structures. Gender was significant with both independent correlation (p = .0001) and exchangeable structure (p = .0003). However, age group was significant using the independent correlation (p = .0160), but non-significant using the exchangeable correlation structure (p = .0584). SAS is a powerful statistical program for analyzing large amounts of correlated data with categorical outcomes.
Abbas Tavakoli, University of South Carolina
Jordan Brittingham, USC/ Arnold School of Public Health
Nikki R. Wooten, USC/College of Social Work
Sensitive data requires elevated security requirements and the flexibility to apply logic that subsets data based on user privileges. Following the instructions in SAS® Visual Analytics: Administration Guide gives you the ability to apply row-level permission conditions. After you have set the permissions, you have to prove through audits who has access and row-level security. This paper provides you with the ability to easily apply, validate, report, and audit all tables that have row-level permissions, along with the groups, users, and conditions that will be applied. Take the hours of maintenance and lack of visibility out of row-level secure data and build confidence in the data and analytics that are provided to the enterprise.
Brandon Kirk, SAS
For SAS® users, PROC TABULATE and PROC REPORT (and its compute blocks) are probably among the most common procedures for calculating and displaying data. It is, however, pretty difficult to calculate and display changes from one column to another using data from other rows with just these two procedures. Compute blocks in PROC REPORT can calculate additional columns, but it would be challenging to pick up values from other rows as inputs. This presentation shows how PROC TABULATE can work with the lag(n) function to calculate rates of change from one period of time to another. This offers the flexibility of feeding into calculations the data retrieved from other rows of the report. PROC REPORT is then used to produce the desired output. The same approach can also be used in a variety of scenarios to produce customized reports.
Khoi To, Office of Planning and Decision Support, Virginia Commonwealth University
Stephen Curry, James Harden, and LeBron James are considered to be three of the most gifted professional basketball players in the National Basketball Association (NBA). Each year the Kia Most Valuable Player (MVP) award is given to the best player in the league. Stephen Curry currently holds this title, followed by James Harden and LeBron James, the first two runners-up. The decision for MVP was made by a panel of judges comprised of 129 sportswriters and broadcasters, along with fans who were able to cast their votes through NBA.com. Did the judges make the correct decision? Is there statistical evidence that indicates that Stephen Curry is indeed deserving of this prestigious title over James Harden and LeBron James? Is there a significant difference between the two runners-up? These are some of the questions that are addressed through this project. Using data collected from NBA.com for the 2014-2015 season, a variety of parametric and nonparametric k-sample methods were used to test 20 quantitative variables. In an effort to determine which of the three players is the most deserving of the MVP title, post-hoc comparisons were also conducted on the variables that were shown to be significant. The time-dependent variables were standardized, because there was a significant difference in the number of minutes each athlete played. These variables were then tested and compared with those that had not been standardized. This led to significantly different outcomes, indicating that the results of the tests could be misleading if the time variable is not taken into consideration. Using the standardized variables, the results of the analyses indicate that there is a statistically significant difference in the overall performances of the three athletes, with Stephen Curry outplaying the other two players. However, the difference between James Harden and LeBron James is not so clear.
Sherrie Rodriguez, Kennesaw State University
The increasing popularity and affordability of wearable devices, together with their ability to provide granular physical activity data down to the minute, have enabled researchers to conduct advanced studies on the effects of physical activity on health and disease. This provides statistical programmers the challenge of processing data and translating it into analyzable measures. One such measure is the number of time-specific bouts of moderate to vigorous physical activity (MVPA) (similar to exercise), which is needed to determine whether the participant meets current physical activity guidelines (for example, 150 minutes of MVPA per week performed in bouts of at least 20 minutes). In this paper, we illustrate how we used SAS® arrays to calculate the number of 20-minute bouts of MVPA per day. We provide working code on how we processed Fitbit Flex data from 63 healthy volunteers whose physical activities were monitored daily for a period of 12 months.
Faith Parsons, Columbia University Medical Center
Keith M Diaz, Columbia University Medical Center
Jacob E Julian, Columbia University Medical Center
The Coordination for the Improvement of Higher Education Personnel (CAPES) is a foundation within the Ministry of Education in Brazil whose central purpose is to coordinate efforts to promote high standards for postgraduate programs inside the country. Structured in a SAS® data warehouse, vast amounts of information about the National Postgraduate System (SNPG) is collected and analyzed daily. This data must be accessed by different operational and managerial profiles, on desktops and mobile devices (in this case, using SAS® Mobile BI). Therefore, accurate and fresh data must be maintained so that is possible to calculate statistics and indicators about programs, courses, teachers, students, and intellectual productions. By using SAS programming within SAS® Enterprise Guide®, all statistical calculations are performed and the results become available for exploration and presentation in SAS® Visual Analytics. Using the report designing tool, an excellent user experience is created by integrating the reports into Sucupira Platform, an online tool designed to provide greater data transparency for the academic community and the general public. This integration is made possible through the creation of public access reports with automatic authentication of guest users, presented within iframes inside the Foundation's platform. The content of the reports is grouped by scope, which makes it possible to view the indicators in different forms of presentation, to apply filters (including from URL GET parameters), and to execute stored processes.
Leonardo de Lima Aguirre, Coordination for the Improvement of Higher Education Personnel
Sergio da Costa Cortes, Coordination for the Improvement of Higher Education Personnel
Marcus Vinicius de Olivera Palheta, Capes
Multivariate statistical analysis plays an increasingly important role as the number of variables being measured increases in educational research. In both cognitive and noncognitive assessments, many instruments that researchers aim to study contain a large number of variables, with each measured variable assigned to a specific factor of the bigger construct. Recalling the educational theories or empirical research, the factor of each instrument usually emerges the same way. Two types of factor analysis are widely used in order to understand the latent relationships among these variables based on different scenarios. (1) Exploratory factor analysis (EFA), which is performed by using the SAS® procedure PROC FACTOR, is an advanced statistical method used to probe deeply into the relationship among the variables and the larger construct and then develop a customized model for the specific assessment. (2) When a model is established, confirmatory factor analysis (CFA) is conducted by using the SAS procedure PROC CALIS to examine the model fit of specific data and then make adjustments for the model as needed. This paper presents the application of SAS to conduct these two types of factor analysis to fulfill various research purposes. Examples using real noncognitive assessment data are demonstrated, and the interpretation of the fit statistics is discussed.
Jun Xu, Educational Testing Service
Steven Holtzman, Educational Testing Service
Kevin Petway, Educational Testing Service
Lili Yao, Educational Testing Service
The objective of this study is to use the GLM procedure in SAS® to solve a complex linkage problem with multiple test forms in educational research. Typically, the ABSORB option in the GLM procedure makes this task relatively easy to implement. Note that for educational assessments, to apply one-dimensional combinations of two-parameter logistic (2PL) models (Hambleton, Swaminathan, and Rogers 1991, ch. 1) and generalized partial credit models (Muraki 1997) to a large-scale high-stakes testing program with very frequent administrations requires a practical approach to link test forms. Haberman (2009) suggested a pragmatic solution of simultaneous linking to solve the challenging linking problem. In this solution, many separately calibrated test forms are linked by the use of least-squares methods. In SAS, the GLM procedure can be used to implement this algorithm by the use of the ABSORB option for the variable that specifies administrations, as long as the data are sorted by order of administration. This paper presents the use of SAS to examine the application of this proposed methodology to a simple case of real data.
Lili Yao, Educational Testing Service
In health care and epidemiological research, there is a growing need for basic graphical output that is clear, easy to interpret, and easy to create. SAS® 9.3 has a very clear and customizable graphic called a Radar Graph, yet it can only display the unique responses of one variable and would not be useful for multiple binary variables. In this paper we describe a way to display multiple binary variables for a single population on a single radar graph. Then we convert our method into a macro with as few parameters as possible to make this procedure available for everyday users.
Kevin Sundquist, Columbia University Medical Center
Jacob E Julian, Columbia University Medical Center
Faith Parsons, Columbia University Medical Center
The SURVEYSELECT procedure is useful for sample selection for a wide variety of applications. This paper presents the application of PROC SURVEYSELECT to a complex scenario involving a state-wide educational testing program, namely, drawing interdependent stratified cluster samples of schools for the field-testing of test questions, which we call items. These stand-alone field tests are given to only small portions of the testing population and as such, a stratified procedure was used to ensure representativeness of the field-test samples. As the field test statistics for these items are evaluated for use in future operational tests, an efficient procedure is needed to sample schools, while satisfying pre-defined sampling criteria and targets. This paper provides an adaptive sampling application and then generalizes the methodology, as much as possible, for potential use in more industries.
Chuanjie Liao, Pearson
Brian Patterson, Pearson
The latest releases of SAS® Data Integration Studio, SAS® Data Management Studio and SAS® Data Integration Server, SAS® Data Governance, and SAS/ACCESS® software provide a comprehensive and integrated set of capabilities for collecting, transforming, and managing your data. The latest features in the product suite include capabilities for working with data from a wide variety of environments and types including Hadoop, cloud, RDBMS, files, unstructured data, streaming, and others, and the ability to perform ETL and ELT transformations in diverse run-time environments including SAS®, database systems, Hadoop, Spark, SAS® Analytics, cloud, and data virtualization environments. There are also new capabilities for lineage, impact analysis, clustering, and other data governance features for enhancements to master data and support metadata management. This paper provides an overview of the latest features of the SAS® Data Management product suite and includes use cases and examples for leveraging product capabilities.
Nancy Rausch, SAS
Each night on the news we hear the level of the Dow Jones Industrial Average along with the 'first difference,' which is today's price-weighted average minus yesterday's. It is that series of first differences that excites or depresses us each night as it reflects whether stocks made or lost money that day. Furthermore, the differences form the data series that has the most addressable statistical features. In particular, the differences have the stationarity requirement, which justifies standard distributional results such as asymptotically normal distributions of parameter estimates. Differencing arises in many practical time series because they seem to have what are called 'unit roots,' which mathematically indicate the need to take differences. In 1976, Dickey and Fuller developed the first well-known tests to decide whether differencing is needed. These tests are part of the ARIMA procedure in SAS/ETS® in addition to many other time series analysis products. I'll review a little of what is was like to do the development and the required computing back then, say a little about why this is an important issue, and focus on examples.
David Dickey, NC State University
For many organizations, the answer to whether to manage their data and analytics in a public or private cloud is going to be both. Both can be the answer for many different reasons: common sense logic not to replace a system that already works just to incorporate something new; legal or corporate regulations that require some data, but not all data, to remain in place; and even a desire to provide local employees with a traditional data center experience while providing remote or international employees with cloud-based analytics easily managed through software deployed via Amazon Web Services (AWS). In this paper, we discuss some of the unique technical challenges of managing a hybrid environment, including how to monitor system performance simultaneously for two different systems that might not share the same infrastructure or even provide comparable system monitoring tools; how to manage authorization when access and permissions might be driven by two different security technologies that make implementation of a singular protocol problematic; and how to ensure overall automation of two platforms that might be independently automated, but not originally designed to work together. In this paper, we share lessons learned from a decade of experience implementing hybrid cloud environments.
Ethan Merrill, SAS
Bryan Harkola, SAS
This paper shows how to create maps from the output of the SAS/STAT® SPP procedure using the same layer of the data. This cut can be done with PROC GINSIDE and the final plot with PROC GMAP. The result is a map that is nicer than the plot produced by PROC SPP.
Alan Silva, University of Brasilia
Do you write reports that sometimes have missing categories across all class variables? Some programmers write all sorts of additional DATA step code in order to show the zeros for the missing rows or columns. Did you ever wonder whether there is an easier way to accomplish this? PROC MEANS and PROC TABULATE, in conjunction with PROC FORMAT, can handle this situation with a couple of powerful options. With PROC TABULATE, we can use the PRELOADFMT and PRINTMISS options in conjunction with a user-defined format in PROC FORMAT to accomplish this task. With PROC SUMMARY, we can use the COMPLETETYPES option to get all the rows with zeros. This paper uses examples from Census Bureau tabulations to illustrate the use of these procedures and options to preserve missing rows or columns.
Chris Boniface, Census Bureau
Janet Wysocki, U.S. Census Bureau