SAS Global Forum 2014 Proceedings

SAS^® Grid Computing is a scale-out SAS^® solution that enables SAS applications to better utilize computing resources, which is extremely I/O and compute intensive. It requires the use of a high-performance shared storage (SS) that allows all servers to access the same file systems. SS may be implemented via traditional NFS NAS or clustered file systems (CFS) like GPFS. This paper uses the Lustre* file system, a parallel, distributed CFS, for a case study of performance scalability of SAS Grid Computing nodes on SS. The paper qualifies the performance of a standardized SAS workload running on Lustre at scale. Lustre has been traditionally used for large and sequential I/O. We will record and present the tuning changes necessary for the optimization of Lustre for the SAS applications. In addition, results from the scaling of SAS Cluster jobs running on Lustre will be presented.

One of the first lessons that SAS^® programmers learn on the job is that numeric and character variables do not play well together, and that type mismatches are one of the more common source of errors in their otherwise flawless SAS programs. Luckily, converting variables from one type to another in SAS (that is, casting) is not difficult, requiring only the judicious use of either the input() or put() function. There remains, however, the danger of data being lost in the conversion process. This type of error is most likely to occur in cases of character-to-numeric variable conversion, most especially when the user does not fully understand the data contained in the data set. This paper will review the basics of data storage for character and numeric variables in SAS, the use of formats and informats for conversions, and how to ensure accurate type conversion of even high-precision numeric values.

This paper shows users how they can use a SAS^® macro named %SURVEYGLM to incorporate information about survey design to Generalized Linear Models (GLM). The R function %svyglm (Lumley, 2004) was used to verify the suitability of the %SURVEYGLM macro estimates. The results show that estimates are closer than the R function and that new distributions can be easily added to the algorithm.

Influence analysis in statistical modeling looks for observations that unduly influence the fitted model. Cook s distance is a standard tool for influence analysis in regression. It works by measuring the difference in the fitted parameters as individual observations are deleted. You can apply the same idea to examining influence of groups of observations for example, the multiple observations for subjects in longitudinal or clustered data but you need to adapt it to the fact that different subjects can have different numbers of observations. Such an adaptation is discussed by Zhu, Ibrahim, and Cho (2012), who generalize the subject size factor as the so-called degree of perturbation, and correspondingly generalize Cook s distances as the scaled Cook s distance. This paper presents the %SCDMixed SAS^® macro, which implements these ideas for analyzing influence in mixed models for longitudinal or clustered data. The macro calculates the degree of perturbation and scaled Cook s distance measures of Zhu et al. (2012) and presents the results with useful tabular and graphical summaries. The underlying theory is discussed, as well as some of the programming tricks useful for computing these influence measures efficiently. The macro is demonstrated using both simulated and real data to show how you can interpret its results for analyzing influence in your longitudinal modeling.

SAS^® and SAS^® Enterprise Miner^™ have provided advanced data mining and machine learning capabilities for years beginning long before the current buzz. Moreover, SAS has continually incorporated advances in machine learning research into its classification, prediction, and segmentation procedures. SAS Enterprise Miner now includes many proven machine learning algorithms in its high-performance environment and is introducing new leading-edge scalable technologies. This paper provides an overview of machine learning and presents several supervised and unsupervised machine learning examples that use SAS Enterprise Miner. So, come back to the future to see machine learning in action with SAS!

Inference of variance components in linear mixed effect models (LMEs) is not always straightforward. I introduce and describe a flexible SAS^® macro (%COVTEST) that uses the likelihood ratio test (LRT) to test covariance parameters in LMEs by means of the parametric bootstrap. Users must supply the null and alternative models (as macro strings), and a data set name. The macro calculates the observed LRT statistic and then simulates data under the null model to obtain an empirical p-value. The macro also creates graphs of the distribution of the simulated LRT statistics. The program takes advantage of processing accomplished by PROC MIXED and some SAS/IML^® functions. I demonstrate the syntax and mechanics of the macro using three examples.

Your company s chronically overloaded SAS^® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS^® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS^® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS^® Enterprise Guide^® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.

Do you need a statistic that is not computed by any SAS^® procedure? Reach for the SAS/IML^® language! Many statistics are naturally expressed in terms of matrices and vectors. For these, you need a matrix-vector language. This hands-on workshop introduces the SAS/IML language to experienced SAS programmers. The workshop focuses on statements that create and manipulate matrices, read and write data sets, and control the program flow. You will learn how to write user-defined functions, interact with other SAS procedures, and recognize efficient programming techniques. Programs are written using the SAS/IML^® Studio development environment. This course covers Chapters 2 4 of Statistical Programming with SAS/IML Software (Wicklin, 2010).

Predicting loss given default (LGD) is playing an increasingly crucial role in quantitative credit risk modeling. In this paper, we propose to apply mixed effects models to predict corporate bonds LGD, as well as other widely used LGD models. The empirical results show that mixed effects models are able to explain the unobservable heterogeneity and to make better predictions compared with linear regression and fractional response regression. All the statistical models are performed in SAS/STAT^®, SAS^® 9.2, using specifically PROC REG and PROC NLMIXED, and the model evaluation metrics are calculated in PROC IML. This paper gives a detailed description on how to use PROC NLMIXED to build and estimate generalized linear models and mixed effects models.

This paper considers the %MRE macro for estimating multivariate ratio estimates. Also, we use PROC REG to estimate multivariate regression estimates and to show that regression estimates are superior to the ratio estimates.

The linear logistic test model (LLTM) that incorporates the cognitive task characteristics into the Rasch model has been widely used for various purposes in educational contexts. However, the LLTM model assumes that the variance of item difficulties is completely accounted for by cognitive attributes. To overcome the disadvantages of the LLTM, Janssen and colleagues (2004) proposed the crossed random-effects (CRE) LLTM by adding the error term on item difficulty. This study examines the accuracy and precision of the CRE-LLTM in terms of parameter estimation for cognitive attributes. The effect of different factors (for example, sample size, population distributions, sparse or dense matrices, and test length), is examined. PROC GLIMMIX was used to do the analysis and SAS/IML^® software was used to generate data.

Predicting news articles that customers are likely to view/read next provides a distinct advantage to news sites. Collaborative filtering is a widely used technique for the same. This paper details an approach within collaborative filtering that uses the cosine similarity function to achieve this purpose. The paper further details two different approaches, customized targeting and article level targeting, that can be used in marketing campaigns. Please note that this presentation connects with Session ID 1887. Session ID 1887 happens immediately following this session

Big data is all the rage these days, with the proliferation of data-accumulating electronic gadgets and instrumentation. At the heart of big data analytics is the MapReduce programming model. As a framework for distributed computing, MapReduce uses a divide-and-conquer approach to allow large-scale parallel processing of massive data. As the name suggests, the model consists of a Map function, which first splits data into key-value pairs, and a Reduce function, which then carries out the final processing of the mapper outputs. It is not hard to see how these functions can be simulated with the SAS^® hash objects technique, and in reality, implemented in the new SAS^® DS2 language. This paper demonstrates how hash object programming can handle data in a MapReduce fashion and shows some potential applications in physics, chemistry, biology, and finance.

One of the most striking features separating SAS^® from other statistical languages is that SAS has native SQL (Structured Query Language) capacity. In addition to the merging or the querying that a SAS user commonly applies in daily practice, SQL significantly enhances the power of SAS in descriptive statistics and data management. In this paper, we show reproducible examples to introduce 10 useful tips for the SQL procedure in the BASE module.

The independent means t-test is commonly used for testing the equality of two population means. However, this test is very sensitive to violations of the population normality and homogeneity of variance assumptions. In such situations, Yuen s (1974) trimmed t-test is recommended as a robust alternative. The purpose of this paper is to provide a SAS^® macro that allows easy computation of Yuen s symmetric trimmed t-test. The macro output includes a table with trimmed means for each of two groups, Winsorized variance estimates, degrees of freedom, and obtained value of t (with two-tailed p-value). In addition, the results of a simulation study are presented and provide empirical comparisons of the Type I error rates and statistical power of the independent samples t-test, Satterthwaite s approximate t-test, and the trimmed t-test when the assumptions of normality and homogeneity of variance are violated.

The new Markov chain Monte Carlo (MCMC) procedure introduced in SAS/STAT^® 9.2 and further exploited in SAS/STAT^® 9.3 enables Bayesian computations to run efficiently with SAS^®. The MCMC procedure allows one to carry out complex statistical modeling within Bayesian frameworks under a wide spectrum of scientific research; in psychometrics, for example, the estimation of item and ability parameters is a kind. This paper describes how to use PROC MCMC for Bayesian inferences of item and ability parameters under a variety of popular item response models. This paper also covers how the results from SAS PROC MCMC are different from or similar to the results from WinBUGS. For those who are interested in the Bayesian approach to item response modeling, it is exciting and beneficial to shift to SAS, based on its flexibility of data managements and its power of data analysis. Using the resulting item parameter estimates, one can continue to test form constructions, test equatings, etc., with all these test development processes being accomplished with SAS!

Over the last year, the SAS^® Enterprise Miner^™ development team has made numerous and wide-ranging enhancements and improvements. New utility nodes that save data, integrate better with open-source software, and register models make your routine tasks easier. The area of time series data mining has three new nodes. There are also new models for Bayesian network classifiers, generalized linear models (GLMs), support vector machines (SVMs), and more.