SAS Global Forum 2017 Proceedings

Technology plays an integral role in every aspect of daily life. As a result, educators should leverage technology-based learning to ensure that students are provided with authentic, engaging, and meaningful learning experiences (Pringle, Dawson, and Ritzhaupt, 2015).The significance and value of computer science understanding continue to increase. A major resource that can be credited with spreading support for computer science is the site Code.org. Its mission is to enable every student in every school to have the opportunity to learn computer science (https://code.org/about). Two years ago, our mentor partnered with Code.org to conduct workshops within the Charlotte, NC area to educate teachers on how to teach computer science activities and concepts in their classrooms. We had the opportunity to assist during the workshops to provide student perspectives and opinions. As we look back on the workshops, we wondered, How are the teachers who attended the workshops implementing the concepts they were taught? After each workshop, a survey was distributed to the attendees to receive workshop feedback and to follow up. We collected the data from the surveys sent to participants and analyzed it using SAS^® University Edition. The results of the survey concluded that the workshops were beneficial and that the educators had implemented a concept that they learned. We believe that computer science activity implementations will assist students across the curriculum.

View the e-poster or slides (PDF)

Data from an extensive survey conducted by the National Center for Education Statistics (NCES) is used for predicting qualified secondary school teachers across public schools in the U.S. The sample data includes socioeconomic data at the county level, which is used as a predictor for hiring a qualified teacher. The resultant model is used to score other regions and is presented on a heat map of the U.S. The survey family of procedures that SAS^® offers, such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC, are used in the analyses since the data involves replicate weights. In looking at residuals from a logistic regression, since all the outcomes (observed values) are either 0 or 1, the residuals do not necessarily follow the normal distribution that is so often assumed in residual analysis. Furthermore, in dealing with survey data, the weights of the observations must be accounted for, as these affect the variance of the observations. To adjust for this, rather than looking at the difference in the observed and predicted values, the difference between the expected and actual counts is calculated by using the weights on each observation, and the predicted probability from the logistic model for the observation. Three types of residuals are analyzed: Pearson, Deviance, and Studentized residuals. The purpose is to identify which type of residuals best satisfy the assumption of normality when investigating residuals from a logistic regression.

Read the paper (PDF)

Designing a survey is a meticulous process involving a number of steps and many complex choices. For most survey researchers, the choice of a probability or non-probability sample is somewhat simple. However, throughout the sample design process, there are more complex choices each of which can introduce bias into the survey estimates. For example, the sampling statistician must decide whether to stratify the frame. And, if so, he has to decide how many strata, whether to explicitly stratify, and how should a stratum be defined. He also has to decide whether to use clusters. And, if so, how to define a cluster and what should be the ideal cluster size. The factors affecting these choices, along with the impact of different sample designs on survey estimates, are explored in this paper. The SURVEYSELECT procedure in SAS/STAT^® 14.1 is used to select a number of samples based on different designs using data from Jamaica's 2011 Population and Housing Census. Census results are assumed to be equal to the true population parameter. The estimates from each selected sample are evaluated against this parameter to assess the impact of different sample designs on point estimates. Design-adjusted survey estimates are computed using the SURVEYMEANS and SURVEYFREQ procedures in SAS/STAT 14.1. The resultant variances are evaluated to determine the sample design that yields the most precise estimates.

Read the paper (PDF)

A/B testing is a form of statistical hypothesis testing on two business options (A and B) to determine which is more effective in the modern Internet age. The challenge for startups or new product businesses leveraging A/B testing are two-fold: a small number of customers and poor understanding of their responses. This paper shows you how to use the IML and POWER procedures to deal with the reassessment of sample size for adaptive multiple business stage designs based on conditional power arguments, using the data observed at the previous business stage.

Read the paper (PDF)

SAS^® has a very efficient and powerful way to get distances between an event and a customer. Using the tables and code located at http://support.sas.com/rnd/datavisualization/mapsonline/html/geocode.html#street, you can load latitude and longitude to addresses that you have for your events and customers. Once you have the tables downloaded from SAS, and you have run the code to get them into SAS data sets, this paper helps guide you through the rest using PROC GEOCODE and the GEODIST function. This can help you determine to whom to market an event. And, you can see how far a client is from one of your facilities.

Read the paper (PDF) | View the e-poster or slides (PDF)

Pooling two or more cross-sectional survey data sets (such as stacking the data sets on top of one another) is a strategy often used by researchers for one of two purposes: (1) to more efficiently conduct significance tests on point estimate changes observed over time or (2) to increase the sample size in hopes of improving the precision of a point estimate. The latter purpose is especially common when making inferences on a subgroup, or domain, of the target population insufficiently represented by a single survey data set. Using data from the National Survey of Family Growth (NSFG), the aim of this paper is to walk through a series of practical estimation objectives that can be tackled by analyzing data from two or more pooled survey data sets. Where applicable, we comment on the resulting interpretive nuances.

Read the paper (PDF)

Confidence intervals are critical to understanding your survey data. If your intervals are too narrow, you might inadvertently judge a result to be statistically significant when it is not. While many familiar SAS^® procedures, such as PROC MEANS and PROC REG, provide statistical tests, they rely on the assumption that the data comes from a simple random sample. However, almost no real-world survey uses such sampling. Learn how to use the SURVEYMEANS procedure and its SURVEY cousins to estimate confidence intervals and perform significance tests that account for the structure of the underlying survey, including the replicate weights now supplied by some statistical agencies. Learn how to extract the results you need from the flood of output that these procedures deliver.

Read the paper (PDF)

Are secondary schools in the United States hiring enough qualified math teachers? In which regions is there a disparity of qualified teachers? Data from an extensive survey conducted by the National Center for Education Statistics (NCES) was used for predicting qualified secondary school teachers across public schools in the US. The three criteria examined to determine whether a teacher is qualified to teach a given subject are: 1) Whether the teacher has a degree in the subject he or she is teaching 2) Whether he or she has a teaching certification in the subject 3) Whether he or she has five years of experience in the subject. A qualified teacher is defined as one who has all three of the previous qualifications. The sample data included socioeconomic data at the county level, which was used as predictors for hiring a qualified teacher. Data such as the number of students on free or reduced lunch at the school was used to assign schools as high-needs or low-needs schools. Other socioeconomic factors included were the income and education levels of working adults within a given school district. Some of the results show that schools with higher-needs students (a school that has more than 40% of the students on some form of reduced lunch program) have less-qualified teachers. The resultant model is used to score other regions and is presented on a heat map of the US. SAS^® procedures such as PROC SURVEYFREQ and PROC SURVEYLOGISTIC are used.

View the e-poster or slides (PDF)

The purpose of this paper is to show a SAS^® macro named %SURVEYGENMOD developed in a SAS/IML^® procedure as an upgrade of macro %SURVEYGLM developed by Silva and Silva (2014) to deal with complex survey design in generalized linear models (GLMs). The new capabilities are the inclusion of negative binomial distribution, zero-inflated Poisson (ZIP) model, zero-inflated negative binomial (ZINB) model, and the possibility to get estimates for domains. The R function svyglm (Lumley, 2004) and Stata software were used as background, and the results showed that estimates generated by the %SURVEYGENMOD macro are close to the R function and Stata software.

Read the paper (PDF)

As a freshman at a large university, life can be fun as well as stressful. The choices a freshman makes while in college might impact his or her overall health. In order to examine the overall health and different behaviors of students at Oklahoma State University, a survey was conducted among the freshmen students. The survey focused on capturing the psychological, environmental, diet, exercise, and alcohol and drug use among students. A total of 795 out of 1,036 freshman students completed the survey, which included around 270 questions that covered the range of issues mentioned above. An exploratory factor analysis identified 26 factors. For example, two factors that relate to the behavior of students under stress are eating and relaxing. Further understanding the variables that contribute to alcohol and drug use might help the university in planning appropriate interventions and preventions. Factor analysis with Cronbach's alpha provided insight into a more defined set of variables to help address these types of issues. We used SAS^® to do factor analysis as well as to create different clusters of students with unique characteristics and profiled these clusters

Read the paper (PDF)

The advent of robust and thorough data collection has resulted in the term big data. With Census data becoming richer, more nationally representative, and voluminous, we need methodologies that are designed to handle the manifold survey designs that Census data sets implement. The relatively nascent PROC SURVEYLOGISTIC, an experimental procedure in SAS^®9 and fully supported in SAS 9.1, addresses some of these methodologies, including clusters, strata, and replicate weights. PROC SURVEYLOGISTIC handles data that is not a straightforward random sample. Using Census data sets, this paper provides examples highlighting the appropriate use of survey weights to calculate various estimates, as well as the calculation and interpretation of odds ratios between categorical variable interactions when predicting a binary outcome.

Read the paper (PDF)

Hash tables are powerful tools when building an electronic code book, which often requires a lot of match-merging between the SAS^® data sets. In projects that span multiple years (e.g., longitudinal studies), there are usually thousands of new variables introduced at the end of every year or at the end of each phase of the project. These variables usually have the same stem or core as the previous year's variables. However, they differ only in a digit or two that usually signifies the year number of the project. So, every year, there is this extensive task of comparing thousands of new variables to older variables for the sake of carrying forward key database elements corresponding to the previously defined variables. These elements can include the length of the variable, data type, format, discrete or continuous flag, and so on. In our SAS program, hash objects are efficiently used to cut down not only time, but also the number of DATA and PROC steps used to accomplish the task. Clean and lean code is much easier to understand. A macro is used to create the data set containing new and older variables. For a specific new variable, the FIND method in hash objects is used in a loop to find the match to the most recent older variable. What was taking around a dozen PROC SQL steps is now a single DATA step using hash tables.

Read the paper (PDF)

One of the research goals in public health is to estimate the burden of diseases on the US population. We describe burden of disease by analyzing the statistical association of various diseases with hospitalizations, emergency department (ED) visits, ambulatory/outpatient (doctors' offices) visits, and deaths. In this short paper, we discuss the use of large, nationally representative databases, such as those offered by the National Center for Health Statistics (NCHS) or the Agency for Healthcare Research and Quality (AHRQ), to produce reliable estimates of diseases for studies. In this example, we use SAS^® and SUDAAN to analyze the Nationwide Emergency Department Sample (NEDS), offered by AHRQ, to estimate ED visits for hand, foot, and mouth disease (HFMD) in children less than five years old.