Nowadays, most corporations build and maintain their own data warehouse, and an ETL (Extract, Transform, and Load) process plays a critical role in managing the data. Some people might create a large program and execute this program from top to bottom. Others might generate a SAS® driver with several programs included, and then execute this driver. If some programs can be run in parallel, then developers must write extra code to handle these concurrent processes. If one program fails, then users can either rerun the entire process or comment out the successful programs and resume the job from where the program failed. Usually the programs are deployed in production with read and execute permission only. Users do not have the priviledge of modifying codes on the fly. In this case, how do you comment out the programs if the job terminated abnormally? This paper illustrates an approach for managing ETL process flows. The approach uses a framework based on SAS, on a UNIX platform. This is a high-level infrastructure discussion with some explanation of the SAS codes that are used to implement the framework. The framework supports the rerun or partial run of the entire process without changing any source codes. It also supports the concurrent process, and therefore no extra code is needed.
Kevin Chung, Fannie Mae
The current study looks at recent health trends and behavior analyses of youth in America. Data used in this analysis was provided by the Center for Disease Control and Prevention and gathered using the Youth Risk Behavior Surveillance System (YRBSS). A factor analysis was performed to identify and define latent mental health and risk behavior variables. A series of logistic regression analyses were then performed using the risk behavior and demographic variables as potential contributing factors to each of the mental health variables. Mental health variables included disordered eating and depression/suicidal ideation data, while the risk behavior variables included smoking, consumption of alcohol and drugs, violence, vehicle safety, and sexual behavior data. Implications derived from the results of this research are a primary focus of this study. Risks and benefits of using a factor analysis with logistic regression in social science research will also be discussed in depth. Results included reporting differences between the years of 1991 and 2011. All results are discussed in relation to current youth health trend issues. Data was analyzed using SAS® 9.3.
Deanna Schreiber-Gregory, North Dakota State University
Have you ever wished that with one click you could copy any SAS® data set, including variable names, so that you could paste the text into a Microsoft Word file, Microsoft PowerPoint slide, or spreadsheet? You can and, with just Base SAS®, there are some little-known but easy-to use methods that are available for automating many of your (or your users ) common tasks.
Tom Abernathy, Pfizer, Inc.
Matthew Kastin, I-Behavior, Inc.
Arthur Tabachneck, myQNA, Inc.
Influence analysis in statistical modeling looks for observations that unduly influence the fitted model. Cook s distance is a standard tool for influence analysis in regression. It works by measuring the difference in the fitted parameters as individual observations are deleted. You can apply the same idea to examining influence of groups of observations for example, the multiple observations for subjects in longitudinal or clustered data but you need to adapt it to the fact that different subjects can have different numbers of observations. Such an adaptation is discussed by Zhu, Ibrahim, and Cho (2012), who generalize the subject size factor as the so-called degree of perturbation, and correspondingly generalize Cook s distances as the scaled Cook s distance. This paper presents the %SCDMixed SAS® macro, which implements these ideas for analyzing influence in mixed models for longitudinal or clustered data. The macro calculates the degree of perturbation and scaled Cook s distance measures of Zhu et al. (2012) and presents the results with useful tabular and graphical summaries. The underlying theory is discussed, as well as some of the programming tricks useful for computing these influence measures efficiently. The macro is demonstrated using both simulated and real data to show how you can interpret its results for analyzing influence in your longitudinal modeling.
Grant Schneider, The Ohio State University
Randy Tobias, SAS Institute
SAS® functions provide amazing power to your DATA step programming. Some of these functions are essential some of them save you writing volumes of unnecessary code. This paper covers some of the most useful SAS functions. Some of these functions might be new to you, and they will change the way you program and approach common programming tasks.
Ron Cody, Camp Verde Associates
Accelerated testing is an effective tool for predicting when systems fail, where the system can be as simple as an engine gasket or as complex as a magnetic resonance imaging (MRI) scanner. In particular, you might conduct an experiment to determine how factors such as temperature and voltage impose enough stress on the system to cause failure. Because system components usually meet nominal quality standards, it can take a long time to obtain failure data under normal-use conditions. An effective strategy is to accelerate the experiment by testing under abnormally stressful conditions, such as higher temperatures. Following that approach, you obtain data more quickly, and you can then analyze the data by using the RELIABILITY procedure in SAS/QC® software. The analysis is a three-step process: you establish a probability model, explore the relationship between stress and failure, and then extrapolate to normal-use conditions. Graphs are a key component of all three stages: you choose a model by comparing residual plots from candidate models, use graphs to examine the stress-failure relationship, and then use an appropriately scaled graph to extrapolate along a straight line. This paper guides you through the process, and it highlights features added to the RELIABILITY procedure in SAS/QC 13.1.
Bobby Gutierrez, SAS
Cluster (group) randomization in trials is increasingly used over patient-level randomization. There are many reasons for this, including more pragmatic trials associated with comparative effectiveness research. Examples of clusters that could be randomized for study are clinics or hospitals, counties within a state, and other geographical areas such as communities. In many of these trials, the number of clusters is relatively small. This can be a problem if there are important covariates at the cluster level that are not balanced across the intervention and control groups. For example, if we randomize eight counties, a simple randomization could put all counties with high socioeconomic status in one group or the other, leaving us without good comparison data. There are strategies to prevent an unlucky cluster randomization. These include matching, stratification, minimization and covariate-constrained randomization. Each method is discussed, and a county-level Health Economics example of covariate-constrained randomization is shown for intermediate SAS® users working with SAS® Foundation for Release 9.2 and SAS/STAT® on a Windows operating system.
Brenda Beaty, University of Colorado
L. Miriam Dickinson, University of Colorado
Finding groups with similar attributes is at the core of knowledge discovery. To this end, Cluster Analysis automatically locates groups of similar observations. Despite successful applications, many practitioners are uncomfortable with the degree of automation in Cluster Analysis, which causes intuitive knowledge to be ignored. This is more true in text mining applications since individual words have meaning beyond the data set. Discovering groups with similar text is extremely insightful. However, blind applications of clustering algorithms ignore intuition and hence are unable to group similar text categories. The challenge is to integrate the power of clustering algorithms with the knowledge of experts. We demonstrate how SAS/STAT® 9.2 procedures and the SAS® Macro Language are used to ensemble the opinion of domain experts with multiple clustering models to arrive at a consensus. The method has been successfully applied to a large data set with structured attributes and unstructured opinions. The result is the ability to discover observations with similar attributes and opinions by capturing the wisdom of the crowds whether man or model.
Masoud Charkhabi, Canadian Imperial Bank of Commerce
Ling Zhu, CIBC
Overdispersion (extra variation) arises in binomial, multinomial, or count data when variances are larger than those allowed by the binomial, multinomial, or Poisson model. This phenomenon is caused by clustering of the data, lack of independence, or both. As pointed out by McCullagh and Nelder (1989), Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception. Several approaches are found for handling overdispersed data, namely quasi-likelihood and likelihood models, generalized estimating equations, and generalized linear mixed models. Some classical likelihood models are presented. Among them are the beta-binomial, binomial cluster (a.k.a. random clumped binomial), negative-binomial, zero-inflated Poisson, zero-inflated negative-binomial, hurdle Poisson, and the hurdle negative-binomial. We focus on how these approaches or models can be implemented in a practical way using, when appropriate, the procedures GLIMMIX, GENMOD, FMM, COUNTREG, NLMIXED, and SURVEYLOGISTIC. Some real data set examples are discussed in order to illustrate these applications. We also provide some guidance on how to analyze generalized linear overdispersion mixed models and possible scenarios where we might encounter them.
Jorge Morel, Procter and Gamble
In randomized experiments, it is generally assumed that the hierarchical structures and variances are the same in the treatment and control groups. In some situations, however, these structures and variance components can differ. Consider a randomized experiment in which individuals randomized to the treatment condition are further assigned to clusters in which the intervention is administered, but no such clustering occurs in the control condition. Such a structure can occur, for example, when the individuals in the treatment condition are randomly assigned to group therapy sessions or to mathematics tutoring groups; individuals in the control condition do not receive group therapy or mathematics tutoring and therefore do not have that level of clustering. In this example, individuals in the treatment condition have a hierarchical structure, but individuals in the control condition do not. If the therapists or tutors differ in efficacy, the clustering in the treatment condition induces an extra source of variability in the data that needs to be accounted for in the analysis. We show how special features of SAS® PROC MIXED and PROC GLIMMIX can be used to analyze data in which one or more treatment groups have a hierarchical structure that differs from that in the control group. We also discuss how to code variables in order to increase the computational efficiency for estimating parameters from these designs.
Sharon Lohr, Westat
Peter Schochet, Mathematica Policy Research
SAS/STAT® 13.1 includes the new ICLIFETEST procedure, which is specifically designed for analyzing interval-censored data. This type of data is frequently found in studies where the event time of interest is known to have occurred not at a specific time but only within a certain time period. PROC ICLIFETEST performs nonparametric survival analysis of interval-censored data and is a counterpart to PROC LIFETEST, which handles right-censored data. With similar syntax, you use PROC ICLIFETEST to estimate the survival function and to compare the survival functions of different populations. This paper introduces you to the ICLIFETEST procedure and presents examples that illustrate how you can use it to perform analyses of interval-censored data.
Changbin Guo, SAS
Gordon Johnston, SAS
Ying So, SAS
Hierarchical data are common in many fields, from pharmaceuticals to agriculture to sociology. As data sizes and sources grow, information is likely to be observed on nested units at multiple levels, calling for the multilevel modeling approach. This paper describes how to use the GLIMMIX procedure in SAS/STAT® to analyze hierarchical data that have a wide variety of distributions. Examples are included to illustrate the flexibility that PROC GLIMMIX offers for modeling within-unit correlation, disentangling explanatory variables at different levels, and handling unbalanced data. Also discussed are enhanced weighting options, new in SAS/STAT 13.1, for both the MODEL and RANDOM statements. These weighting options enable PROC GLIMMIX to handle weights at different levels. PROC GLIMMIX uses a pseudolikelihood approach to estimate parameters, and it computes robust standard error estimators. This new feature is applied to an example of complex survey data that are collected from multistage sampling and have unequal sampling probabilities.
Min Zhu, SAS
A central component of discussions of healthcare reform in the U.S. are estimations of healthcare cost and use at the national or state level, as well as for subpopulation analyses for individuals with certain demographic properties or medical conditions. For example, a striking but persistent observation is that just 1% of the U.S. population accounts for more than 20% of total healthcare costs, and 5% account for almost 50% of total costs. In addition to descriptions of specific data sources underlying this type of observation, we demonstrate how to use SAS® to generate these estimates and to extend the analysis in various ways; that is, to investigate costs for specific subpopulations. The goal is to provide SAS programmers and healthcare analysts with sufficient data-source background and analytic resources to independently conduct analyses on a wide variety of topics in healthcare research. For selected examples, such as the estimates above, we concretely show how to download the data from federal web sites, replicate published estimates, and extend the analysis. An added plus is that most of the data sources we describe are available as free downloads.
Paul Gorrell, IMPAQ International
Sampling is widely used in different fields for quality control, population monitoring, and modeling. However, the purposes of sampling might be justified by the business scenario, such as legal or compliance needs. This paper uses one probability sampling method, stratified sampling, combined with quality control review business cost to determine an optimized procedure of sampling that satisfies both statistical selection criteria and business needs. The first step is to determine the total number of strata by grouping the strata with a small number of sample units, using box-and-whisker plots outliers as a whole. Then, the cost to review the sample in each stratum is justified by a corresponding business counter-party, which is the human working hour. Lastly, using the determined number of strata and sample review cost, optimal allocation of predetermined total sample population is applied to allocate the sample into different strata.
Yi Du, Freddie Mac
As IT professionals, saving time is critical. Delivering timely and quality-looking reports and information to management, end users, and customers is essential. SAS® provides numerous 'canned' PROCedures for generating quick results to take care of these needs ... and more. In this hands-on workshop, attendees acquire basic insights into the power and flexibility offered by SAS PROCedures using PRINT, FORMS, and SQL to produce detail output; FREQ, MEANS, and UNIVARIATE to summarize and create tabular and statistical output; and data sets to manage data libraries. Additional topics include techniques for informing SAS which data set to use as input to a procedure, how to subset data using a WHERE statement (or WHERE= data set option), and how to perform BY-group processing.
Kirk Paul Lafler, Software Intelligence Corporation
There is an ever-increasing number of study designs and analysis of clinical trials using Bayesian frameworks to interpret treatment effects. Many research scientists prefer to understand the power and probability of taking a new drug forward across the whole range of possible true treatment effects, rather than focusing on one particular value to power the study. Examples are used in this paper to show how to compute Bayesian probabilities using the SAS/STAT® MIXED procedure and UNIVARIATE procedure. Particular emphasis is given to the application on efficacy analysis, including the comparison of new drugs to placebos and to standard drugs on the market.
Howard Liang, inVentiv health Clinical
As complicated as the macro language is to learn, there are very strong reasons for doing so. At its heart, the macro language is a code generator. In its simplest uses, it can substitute simple bits of code like variable names and the names of data sets that are to be analyzed. In more complex situations, it can be used to create entire statements and steps based on information may even be unavailable to the person writing or even executing the macro. At the time of execution, it can be used to make queries of the SAS® environment as well as the operating system, and utilize the gathered information to make informed decisions about how it is to further function and execute.
Art Carpenter, California Occidental Consultants
Because the macro language is primarily a code generator, it makes sense that the code that it creates must be generated before it can be executed. This implies that execution of the macro language comes first. Simple as this is in concept, timing issues and conflicts are often not so simple to recognize in application. As we use the macro language to take on more complex tasks, it becomes even more critical that we have an understanding of these issues.
Art Carpenter, California Occidental Consultants
Macro variables and their values are stored in symbol tables, which in turn are held in memory. Not only are there are a number of ways to create macro variables, but they can be created in a wide variety of situations. How they are created and under what circumstances effects the variable s scope how and where the macro variable is stored and retrieved. There are a number of misconceptions about macro variable scope and about how the macro variables are assigned to symbol tables. These misconceptions can cause problems that the new, and sometimes even the experienced, macro programmer does not anticipate. Understanding the basic rules for macro variable assignment can help the macro programmer solve some of these problems that are otherwise quite mystifying.
Art Carpenter, California Occidental Consultants
This paper provides an overview of how to create a SAS® Enterprise Guide® process that is well designed, simple, documented, automated, modular, efficient, reliable, and easy to maintain. Topics include how to organize a SAS Enterprise Guide process, how to best document in SAS Enterprise Guide, when to leverage point-and-click functionality, and how to automate and simplify SAS Enterprise Guide processes. This paper has something for any SAS Enterprise Guide user, new or experienced!
Steven First, Systems Seminar Consultants
Jennifer First-Kluge, Systems Seminar Consultants
The emerging discipline of data governance encompasses data quality assurance, data access and use policy, security risks and privacy protection, and longitudinal management of an organization s data infrastructure. In the interests of forestalling another bureaucratic solution to data governance issues, this presentation features database programming tools that provide rapid access to big data and make selective access to and restructuring of metadata practical.
Sigurd Hermansen, Westat
Digital data has manifested into a classic BIG DATA challenge for marketers who want to push past the retroactive analysis limitations of traditional web analytics. The current groundswell of digital device adoption and variety of digital interactions grows larger year after year. The opportunity for 'digital intelligence' has arrived, as traditional web analytic techniques were not designed for the breadth of channels, devices, and pace that fuels consumer experiences. In parallel, today's landscape for data visualization, advanced analytics, and our ability to process very large amounts of multi-channel information is changing. The democratization of analytics for the masses is upon us, and marketers have the oppourtunity to take advantage of descriptive, predictive, and (most importantly) prescriptive data-driven insights. This presentation describes how organizations can use SAS® products, specifically SAS® Visual Analytics and SAS® Adaptive Customer Experience, to overcome the limitations of web analytics, and support data-driven integrated marketing objectives.
Suneel Grover, SAS
The Affordable Care Act (ACA) contains provisions that have stimulated interest in analytics among health care providers, especially those provisions that address quality of outcomes. High Impact Technologies (HIT) has been addressing these issues since before passage of the ACA and has a Health Care Data Model recognized by Gartner and implemented at several health care providers. Recently, HIT acquired SAS® Visual Analytics, and this paper reports our successful efforts to use SAS Visual Analytics for visually exploring Big Data for health care providers. Health care providers can suffer significant financial penalties for readmission rates above a certain threshold and other penalties related to quality of care. We have been able to use SAS Visual Analytics, coupled with our experience gained from implementing the HIT Healthcare Data Model at a number of Healthcare providers, to identify clinical measures that are significant predictors for readmission. As a result, we can help health care providers reduce the rate of 30-day readmissions.
Diane Hatcher, SAS
Joe Whitehurst, High Impact Technologies, Inc.
Inference of variance components in linear mixed effect models (LMEs) is not always straightforward. I introduce and describe a flexible SAS® macro (%COVTEST) that uses the likelihood ratio test (LRT) to test covariance parameters in LMEs by means of the parametric bootstrap. Users must supply the null and alternative models (as macro strings), and a data set name. The macro calculates the observed LRT statistic and then simulates data under the null model to obtain an empirical p-value. The macro also creates graphs of the distribution of the simulated LRT statistics. The program takes advantage of processing accomplished by PROC MIXED and some SAS/IML® functions. I demonstrate the syntax and mechanics of the macro using three examples.
Peter Ott, BC Ministry of Forests, Lands & NRO
The use of Cohen s kappa has enjoyed a growing popularity in the social sciences as a way of evaluating rater agreement on a categorical scale. The kappa statistic can be calculated as Cohen first proposed it in his 1960 paper or by using any one of a variety of weighting schemes. The most popular among these are the linear weighted kappa and the quadratic weighted kappa. Currently, SAS® users can produce the kappa statistic of their choice through PROC FREQ and the use of relevant AGREE options. Complications arise however when the data set does not contain a completely square cross-tabulation of data. That is, this method requires that both raters have to have at least one data point for every available category. There have been many solutions offered for this predicament. Most suggested solutions include the insertion of dummy records into the data and then assigning a weight of zero to those records through an additional class variable. The result is a multi-step macro, extraneous variable assignments, and potential data integrity issues. The author offers a much more elegant solution by producing a segment of code which uses brute force to calculate Cohen s kappa as well as all popular variants. The code uses nested PROC SQL statements to provide a single conceptual step which generates kappa statistics of all types even those that the user wishes to define for themselves.
Matthew Duchnowski, Educational Testing Service (ETS)
This paper demonstrates the new case-level residuals in the CALIS procedure and how they differ from classic residuals in structural equation modeling (SEM). Residual analysis has a long history in statistical modeling for finding unusual observations in the sample data. However, in SEM, case-level residuals are considerably more difficult to define because of 1) latent variables in the analysis and 2) the multivariate nature of these models. Historically, residual analysis in SEM has been confined to residuals obtained as the difference between the sample and model-implied covariance matrices. Enhancements to the CALIS procedure in SAS/STAT® 12.1 enable users to obtain case-level residuals as well. This enables a more complete residual and influence analysis. Several examples showing mean/covariance residuals and case-level residuals are presented.
Catherine Truxillo, SAS
Often in a clinical trial, measures are needed to describe pain, discomfort, or physical constraints that are visible but not measurable through lab tests or other vital signs. In these cases, researchers turn to questionnaires to provide documentation of improvement or statistically meaningful change in support safety and efficacy hypotheses. For example, in studies (like Parkinson s studies) where pain or depression are serious non-motor symptoms of the disease, these questionnaires provide primary endpoints for analysis. Questionnaire data presents unique challenges in both collection and analysis in the world of CDISC standards. The questions are usually aggregated into scale scores, as the underlying questions by themselves provide little additional usefulness. SAS® is a powerful tool for extraction of the raw data from the collection databases and transposition of columns into a basic data structure in SDTM, which is vertical. The data is then processed further as per the instructions in the Statistical Analysis Plan (SAP). This involves translation of the originally collected values into sums, and the values of some questions need to be reversed. Missing values can be computed as means of the remaining questions. These scores are then saved as new rows in the ADaM (analysis-ready) data sets. This paper describes the types of questionnaires, how data collection takes place, the basic CDISC rules for storing raw data in SDTM, and how to create analysis data sets with derived records using ADaM standards, while maintaining traceability to the original question.
Karin LaPann, PRA International
Terek Peterson
Usually, log files are checked by users only when SAS® completes the execution of programs. If SAS finds any errors in the current line, it skips the current step and executes the next line. The process is completed only at the execution complete program. There are a few programs that will take more than a day to complete. In this case, the user opens the log file in Read-Only mode frequently to check for errors, warnings, and unexpected notes and terminates the execution of the program manually if any potential messages are identified. Otherwise, the user will be notified with the errors in the log file only at the end of the execution. Our suggestion is to run the parallel utility program along with the production program to check the log file of the currently running program and to notify the user through an e-mail when an error, warning, or unexpected note is found in the log file. Also, the execution can be terminated automatically and the user can be notified when potential messages are identified.
Harun Rasheed, Cognizant Technology Solutions
Amarnath Vijayarangan, Genpact
Can clustering discharge records improve a predictive model s overall fit statistic? Do the predictors differ across segments? This talk describes the methods and results of data mining pediatric IBD patient records from my Analytics 2013 poster. SAS® Enterprise Miner™ 12.1 was used to segment patients and model important predictors for the length of hospital stay using discharge records from the national Kid s Inpatient Database. Profiling revealed that patient segments were differentiated by primary diagnosis, operating room procedure indicator, comorbidities, and factors related to admission and disposition of patient. Cluster analysis of patient discharges improved the overall average square error of predictive models and distinguished predictors that were unique to patient segments.
Goutam Chakraborty, Oklahoma State University
Linda Schumacher, Oklahoma State University
The graphical display of the individual data is important in understanding the raw data and the relationship between the variables in the data set. You can explore your data to ensure statistical assumptions hold by detecting and excluding outliers if they exist. Since you can visualize what actually happens to individual subjects, you can make your conclusions more convincing in statistical analysis and interpretation of the results. SAS® provides many tools for creating graphs of individual data. In some cases, multiple tools need to be combined to make a specific type of graph that you need. Examples are used in this paper to show how to create graphs of individual data using the SAS® ODS Graphics procedures (SG procedures).
Howard Liang, inVentiv health Clinical
Missing data commonly occurs in medical, psychiatry, and social researches. The SAS® MI and MIANALYZE procedures are often used to generate multiple imputations and then provide valid statistical inferences based on them. However, MIANALYZE is not applicable to combine type-III analyses obtained using multiple imputed data sets. In this manuscript, we write a macro to combine the type-III analyses generated from the SAS MIXED procedure based on multiple imputations. The proposed method can be extended to other procedures reporting type-III analyses, such as GENMOD and GLM.
Yixin Fang, New York University School of Medicine
Man Jin, Forest Research Institute
Binhuan Wang, New York University School of Medicine
The CDISC Study Data Tabulation Model (SDTM) provides a standardized structure and specification for a broad range of human and animal study data in pharmaceutical research, and is widely adopted in the industry for the submission of the clinical trial data. Because SDTM requires additional variables and datasets that are not normally available in the clinical database, further programming is required to convert the clinical database into the SDTM datasets. This presentation introduces the concept and general requirements of SDTM, and the different approaches in the SDTM data conversion process. The author discusses database design considerations, implementation procedures, and SAS® macros that can be used to maximize the efficiency of the process. The creation of the metadata DEFINE.XML and the final STDM dataset validation are also discussed.
Hong Chen, McDougall Scientific Ltd.
The first table in many research journal articles is a statistical comparison of demographic traits across study groups. It might not be exciting, but it s necessary. And although SAS® calculates these numbers with ease, it is a time-consuming chore to transfer these results into a journal-ready table. Introducing the time-saving deluxe %MAKETABLE SAS macro it does the heavy work for you. It creates a Microsoft Word table of up to four comparative groups reporting t-tests, chi-square, ANOVA, or median test results, including a p-value. You specify only a one-line macro call for each line in the table, and the macro takes it from there. The result is a tidily formatted journal-ready Word table that you can easily include in a manuscript, report, or Microsoft PowerPoint presentation. For statisticians and researchers needing to summarize group comparisons in a table, this macro saves time and relieves you from the drudgery of trying to make your output neat and pretty. And after all, isn t that what we want computing to do for us?
Alan Elliott, Southern Methodist University
Merging or joining data sets is an integral part of the data consolidation process. Within SAS®, there are numerous methods and techniques that can be used to combine two or more data sets. We commonly think that within the DATA step the MERGE statement is the only way to join these data sets, while in fact, the MERGE is only one of numerous techniques available to us to perform this process. Each of these techniques has advantages, and some have disadvantages. The informed programmer needs to have a grasp of each of these techniques if the correct technique is to be applied. This paper covers basic merging concepts and options within the DATA step, as well as a number of techniques that go beyond the traditional MERGE statement. These include fuzzy merges, double SET statements, and the use of key indexing. The discussion will include the relative efficiencies of these techniques, especially when working with large data sets.
Art Carpenter, California Occidental Consultants
Cross-visit checks are a vital part of data cleaning for longitudinal studies. The nature of longitudinal studies encourages repeatedly collecting the same information. Sometimes, these variables are expected to remain static, go away, increase, or decrease over time. This presentation reviews the na ve and the better approaches at handling one-variable and two-variable consistency checks. For a single-variable check, the better approach features the new ALLCOMB function, introduced in SAS® 9.2. For a two-variable check, the better approach uses the .first pseudo-class to flag inconsistencies. This presentation will provide you the tools to enhance your longitudinal data cleaning process.
Lauren Parlett, Johns Hopkins University
Your company s chronically overloaded SAS® environment, adversely impacted user community, and the resultant lackluster productivity have finally convinced your upper management that it is time to upgrade to a SAS® grid to eliminate all the resource problems once and for all. But after the contract is signed and implementation begins, you as the SAS administrator suddenly realize that your company-wide standard mode of SAS operations, that is, using the traditional SAS® Display Manager on a server machine, runs counter to the expectation of the SAS grid your users are now supposed to switch to SAS® Enterprise Guide® on a PC. This is utterly unacceptable to the user community because almost everything has to change in a big way. If you like to play a hero in your little world, this is your opportunity. There are a number of things you can do to make the transition to the SAS grid as smooth and painless as possible, and your users get to keep their favorite SAS Display Manager.
Houliang Li, HL SASBIPros Inc
This paper outlines the techniques that I have used with my clients over the last five years to build powerful applications that run from a web browser. The user interface is presented using HTML and JavaScript, which is generated by SAS® Stored Processes. A JavaScript framework called Ext JS is used to build components such as tables and graphs, which have a lot of functionality built in. A range of SAS® macros are used for building HTML and JavaScript, so the generation of the user interface is simplified. This technique has been used to create a medical monitoring system, the UK Census MIS, and a bank's risk management application. I also discuss some techniques involved with integrating a system like this with SAS® Portal, cubes, and web reports.
Philip Mason, Wood Street Consultants
Particle swarm optimization is a heuristic global optimization method that was given by James Kennedy and Russell C. Eberhart in 1995. (James Kennedy and Russell C. Eberhart). The purpose of this paper develops a code for particle swarm optimization in SAS® 9.2.
Rupesh Agarwal, Decision Quotient
Sangita Kumbharvadiya, Decision Quotient
APP is an unofficial collective abbreviation for the SAS® functions ADDR, PEEK, PEEKC, the CALL POKE routine, and their so-called LONG 64-bit counterparts the SAS tools designed to directly read from and write to physical memory in the DATA step. APP functions have long been a SAS dark horse. First, the examples of APP usage in SAS documentation amount to a few technical report tidbits intended for mainframe system programming, with nary a hint how the functions can be used for data management programming. Second, the documentation note on the CALL POKE routine is so intimidating in tone that many potentially receptive folks might decide to avoid the allegedly precarious route altogether. However, little can stand in the way of an inquisitive SAS programmer daring to take a close look, and it turns out that APP functions are very simple and useful tools! They can be used to explore how things really work, to make code more concise, to implement en masse data movement, and they can often dramatically improve execution efficiency. The author and many other SAS experts (notably Peter Crawford, Koen Vyverman, Richard DeVenezia, Toby Dunn, and the fellow masked by his 'Puddin' Man' sobriquet) have been poking around the SAS APP realm on SAS-L and in their own practices since 1998, occasionally letting the SAS community at large to peek at their findings. This opus is an attempt to circumscribe the results in a systematic manner. Welcome to the APP world! You are in for a few glorious surprises.
Paul Dorfman, Dorfman Consulting
The implicit loop refers to the DATA step repetitively reading data and creating observations, one at a time. The explicit loop, which uses the iterative DO, DO WHILE, or DO UNTIL statements, is used to repetitively execute certain SAS® statements within each iteration of the DATA step execution. Explicit loops are often used to simulate data and to perform a certain computation repetitively. However, when an explicit loop is used along with array processing, the applications are extended widely, which includes transposing data, performing computations across variables, and so on. To be able to write a successful program that uses loops and arrays, one needs to know the contents in the program data vector (PDV) during the DATA step execution, which is the fundamental concept of DATA step programming. This workshop covers the basic concepts of the PDV, which is often ignored by novice programmers, and then illustrates how to use loops and arrays to transform lengthy code into more efficient programs.
Arthur Li, City of Hope
The availability of specialized programming and analysis resources in academic medical centers is often limited, creating a significant challenge for clinical research. The current work describes how Base SAS® and SAS® Enterprise Guide® are being used to empower research staff so that they are less reliant on these scarce resources.
Chris Schacherer, Clinical Data Management Systems, LLC
SAS® is an outstanding suite of software, but not everyone in the workplace speaks SAS. However, almost everyone speaks Excel. Often, the data you are analyzing, the data you are creating, and the report you are producing is a form of a Microsoft Excel spreadsheet. Every year at SAS® Global Forum, there are SAS and Excel presentations, not just because Excel isso pervasive in the workplace, but because there s always something new to learn (or re-learn)! This paper summarizes and references (and pays homage to!) previous SAS Global Forum presentations, as well as examines some of the latest Excel capabilities with the latest versions of SAS® 9.4 and SAS® Visual Analytics.
Andrew Howell, ANJ Solutions
Explore the various DATA step merge and PROC SQL join processes. This presentation examines the similarities and differences between merges and joins, and provides examples of effective coding techniques. Attendees examine the objectives and principles behind merges and joins, one-to-one merges (joins), and match-merge (equi-join), as well as the coding constructs associated with inner and outer merges (joins) and PROC SQL set operators.
Kirk Paul Lafler, Software Intelligence Corporation
The growing adoption of electronic systems for keeping medical records provides an opportunity for health care practitioners and biomedical researchers to access traditionally unstructured data in a new and exciting way. Pathology reports, progress notes, and many other sections of the patient record that are typically written in a narrative format can now be analyzed by employing natural language processing contextual extraction techniques to identify specific concepts contained within the text. Linking these concepts to a standardized nomenclature (for example, SNOMED CT, ICD-9, ICD-10, and so on) frees analysts to explore and test hypotheses using these observational data. Using SAS® software, we have developed a solution in order to extract data from the unstructured text found in medical pathology reports, link the extracted terms to biomedical ontologies, join the output with more structured patient data, and view the results in reports and graphical visualizations. At its foundation, this solution employs SAS® Enterprise Content Categorization to perform entity extraction using both manually and automatically generated concept definition rules. Concept definition rules are automatically created using technology developed by SAS, and the unstructured reports are scored using the DS2/SAS® Content Categorization API. Results are post-processed and added to tables compatible with SAS® Visual Analytics, thus enabling users to visualize and explore data as required. We illustrate the interrelated components of this solution with examples of appropriate use cases and describe manual validation of performance and reliability with metrics such as precision and recall. We also provide examples of reports and visualizations created with SAS Visual Analytics.
Eric Brinsfield, SAS
Greg Massey, SAS
Adrian Mattocks, SAS
Radhikha Myneni, SAS
Paper SAS2203-2014:
Getting Started with Mixed Models
This introductory presentation is intended for an audience new to mixed models who wants to get an overview of this useful class of models. Learn about mixed models as an extension of ordinary regression models, and see several examples of mixed models in social, agricultural, and pharmaceutical research.
Catherine Truxillo, SAS
It is not uncommon to find models with random components like location, clinic, teacher, etc., not just the single error term we think of in ordinary regression. This paper uses several examples to illustrate the underlying ideas. In addition, the response variable might be Poisson or binary rather than normal, thus taking us into the realm of generalized linear mixed models, These too will be illustrated with examples.
David Dickey, NC State University
Beginning with SA®S 9.2, ODS Graphics introduces a whole new way of generating graphs using SAS®. With just a few lines of code, you can create a wide variety of high-quality graphs. This paper covers the three basic ODS Graphics procedures SGPLOT, SGPANEL, and SGSCATTER. SGPLOT produces single-celled graphs. SGPANEL produces multi-celled graphs that share common axes. SGSCATTER produces multi-celled graphs that might use different axes. This paper shows how to use each of these procedures in order to produce different types of graphs, how to send your graphs to different ODS destinations, how to access individual graphs, and how to specify properties of graphs, such as format, name, height, and width.
Lora Delwiche, University of California, Davis
Susan Slaughter, Avocet Solutions
SAS® High-Performance Analytics is a significant step forward in the area of high-speed, analytic processing in a scalable clustered environment. However, Big Data problems generally come with data from lots of data sources, at varying levels of maturity. Teradata s innovative Unified Data Architecture (UDA) represents a significant improvement in the way that large companies can think about Enterprise Data Management, including the Teradata Database, Hortonworks Hadoop, and Aster Data Discovery platform in a seamless integrated platform. Together, the two platforms provide business users, analysts, and data scientists with the ideally suited data management platforms, targeted specifically to their analytic needs, based upon analytic use cases, managed in a single integrated enterprise data management environment. The paper will focus on how several companies today are using Teradata s Integrated Hardware and Software UDA Platform to manage a single enterprise analytic environment, fight the ongoing proliferation of analytic data marts, and speed their operational analytic processes.
John Cunningham, Teradata Corporation
Controlled vocabularies define a common set of concepts that retain their meaning across contexts, supporting consistent use of terms to annotate, integrate, retrieve, and interpret information. Controlled vocabularies are large hierarchical structures that cannot be represented using typical SAS® practices (e.g., SAS format statements and hash objects). This paper compares and contrasts three models for representing hierarchical structures using SAS data sets: adjacency list, path enumeration, and nested set (Celko, 2004; Mackey, 2002). Specific controlled vocabularies include a university organizational structure and several biological vocabularies (MeSH, NCBI Taxonomy, and GO). The paper presents data models and SAS code for populating tables and performing queries. The paper concludes with a discussion of implications for data warehouse implementation and future work related to efficiency of update and delete operations.
Glenn Colby, University of Colorado Boulder
The use, limits, and misuse of statistical models in different industries are propelling new techniques and best practices in forecasting. Until recently, many factors such as data collection and storage constraints, poor data synchronization capabilities, technology limitations, and limited internal analytical expertise have made it impossible to forecast intermittent demand. In addition, integrating consumer demand data (that is, point-of-sale [POS]/syndicated scanner data from ACNielsen/ Information Resources Inc. [IRI]/Intercontinental Marketing Services [IMS]) to shipment forecasts was a challenge. This presentation gives practical how-to advice on intermittent forecasting and outlines a framework, using multi-tiered causal analysis (MTCA), that links demand to supply. The framework uses a process of nesting causal models together by using data and analytics.
Edward Katz, SAS
This session introduces frailty models and their use in biostatistics to model time-to-event or survival data. The session uses examples to review situations in which a frailty model is a reasonable modeling option, to describe which SAS® procedures can be used to fit frailty models, and to discuss the advantages and disadvantages of frailty models compared to other modeling options.
John Amrhein, McDougall Scientific Ltd.
Report automation and scheduling are very hot topics in many industries. They confer many advantages including reduced work load, elimination of repetitive tasks, generatation of accurate results, and better performance. This paper illustrates how to design an appropriate program to automate and schedule reports in SAS® 9.1 and SAS® Enterprise Guide® 5.1 using a SAS® server as well as the Windows Scheduler. The automation part includes good aspects of formatting Microsoft Excel tables using XML or VBA coding or any other formats, and conditional auto e-mailing with file attachments. We systematically walk through each step with a clear flow diagram from the data source to the final destination. We also discuss details of server-side and PC-side schedulers and how these schedulers involve invoking batch programs.
Anjan Matlapudi, AmerihealthCaritas
The soaring number of publicly available data sets across disciplines have allowed for increased access to real-life data for use in both research and educational settings. These data often leverage cost-effective complex sampling designs including stratification and clustering, which allow for increased efficiency in survey data collection and analyses. Weighting becomes a necessary component in these survey data in order to properly calculate variance estimates and arrive at sound inferences through statistical analysis. Generally speaking, these weights are included with the variables provided in the public use data, though an explanation for how and when to use these weights is often lacking. This paper presents an analysis using the California Health Interview Survey to compare weighted and non-weighted results using SAS® PROC LOGISTIC and PROC SURVEYLOGISTIC.
Besa Smith, Analydata
Tyler Smith, National University
Effective graphs are indispensable for modern statistical analysis. They reveal tendencies that are not readily apparent in simple tables and add visual clarity to reports. My client is a big graph fan; he always shows me a lot of high-quality and complex sample graphs that were created by other software and asks me Can SAS® duplicate these outputs? Often, by leveraging the capabilities of the ODS Graph Template Language and the SGRENDER procedure, the answer is Yes . Graph Template Language offers SAS users a more direct approach to customize the output and to overlay graphs in different levels. This paper uses cases drawn from a real work situation to demonstrate how to get the seemingly unattainable results with the power of Graph Template Language: utilizing bubble plots as your distribution density bars creating refreshing looking linear regression graphics with the slop information in the legend overlaying different plots together to create sophisticated analytical bottleneck test output
Wen Song, ICF International
Ge Wu, Johns Hopkins University
How do you compare group responses when the data are unbalanced or when covariates come into play? Simple averages will not do, but LS-means are just the ticket. Central to postfitting analysis in SAS/STAT® linear modeling procedures, LS-means generalize the simple average for unbalanced data and complicated models. They play a key role both in standard treatment comparisons and Type III tests and in newer techniques such as sliced interaction effects and diffograms. This paper reviews the definition of LS-means, focusing on their interpretation as predicted population marginal means, and it illustrates their broad range of use with numerous examples.
Weijie Cai, SAS
One of the most common questions about logistic regression is How do I know if my model fits the data? There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-squared) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-squared measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered.
Paul Allison, University of Pennsylvania
In applied statistical practice, incomplete measurement sequences are the rule rather than the exception. Fortunately, in a large variety of settings, the stochastic mechanism governing the incompleteness can be ignored without hampering inferences about the measurement process. While ignorability only requires the relatively general missing at random assumption for likelihood and Bayesian inferences, this result cannot be invoked when non-likelihood methods are used. We will first sketch the framework used for contemporary missing-data analysis. Apart from revisiting some of the simpler but problematic methods, attention will be paid to direct likelihood and multiple imputation. Because popular non-likelihood-based methods do not enjoy the ignorability property in the same circumstances as likelihood and Bayesian inferences, weighted versions have been proposed. This holds true in particular for generalized estimating equations (GEE). Even so-called doubly-robust versions have been derived. Apart from GEE, also pseudo-likelihood based strategies can be adapted appropriately. We describe a suite of corrections to the standard form of pseudo-likelihood, to ensure its validity under missingness at random. Our corrections follow both single and double robustness ideas, and is relatively simple to apply.
Geert Molenberghs, Universiteit Hasselt & KU Leuven
Bootstrapped Decision Tree is a variable selection method used to identify and eliminate unintelligent variables from a large number of initial candidate variables. Candidates for subsequent modeling are identified by selecting variables consistently appearing at the top of decision trees created using a random sample of all possible modeling variables. The technique is best used to reduce hundreds of potential fields to a short list of 30 50 fields to be used in developing a model. This method for variable selection has recently become available in JMP® under the name BootstrapForest; this paper presents an implementation in Base SAS®9. The method does accept but does not require a specific outcome to be modeled and will therefore work for nearly any type of model, including segmentation, MCMC, multiple discrete choice, in addition to standard logistic regression. Keywords: Bootstrapped Decision Tree, Variable Selection
David Corliss, Magnify Analytic Solutions
Graphs in oncology studies are essential for getting more insight about the clinical data. This presentation demonstrates how ODS Graphics can be effectively and easily used to create graphs used in oncology studies. We discuss some examples and illustrate how to create plots like drug concentration versus time plots, waterfall charts, comparative survival plots, and other graphs using Graph Template Language and ODS Graphics procedures. These can be easily incorporated into a clinical report.
Debpriya Sarkar, SAS
The SQL procedure contains many powerful and elegant language features for intermediate and advanced SQL users. This presentation discusses topics that will help SAS® users unlock the many powerful features, options, and other gems found in the SQL universe. Topics include CASE logic; a sampling of summary (statistical) functions; dictionary tables; PROC SQL and the SAS macro language interface; joins and join algorithms; PROC SQL statement options _METHOD, MAGIC=101, MAGIC=102, and MAGIC=103; and key performance (optimization) issues.
Kirk Paul Lafler, Software Intelligence Corporation
Over the years, there has been a growing concern about consumption of tobacco among youth. But no concrete studies have been done to find what exactly leads the children to start consuming tobacco. This study is an attempt to figure out the potential reasons for the same. Through our analysis, we have also tried to build A model to predict whether a child would smoke next year or not. This study is based on the 2011 National Youth Tobacco Survey data of 18,867 observations. In order to prepare data for insightful analysis, imputation operations were performed on the data using tree-based imputation methods. From a pool of 197 variables, 48 key variables were selected using variable selection methods, partial least squares, and decision tree models. Logistic Regression and Decision Tree models were built to predict whether a child would smoke in the next year or not. Comparing the models using Misclassification rate as the selection criteria, we found that the Stepwise Logistic Regression Model outperformed other models with a Validation Misclassification of 0.028497, 47.19% Sensitivity and 95.80% Specificity. Factors such as company of friends, cigarette brand ads, accessibility to the tobacco products, and passive smoking turned out to be the most important predictors in determining a child smoker. After this study, we could outline some important findings like the odds of a child taking up smoking are 2.17 times high when his close friends are also smoking.
Goutam Chakraborty, Oklahoma State University
JIN HO JUNG, Oklahoma State University
Gaurav Pathak, Oklahoma State University
The steady expansion of electronic health records (EHR) over the past decade has increased the use of observational healthcare data for analysis. One of the challenges with EHR data is to combine information from different domains (diagnosis, procedures, drugs, adverse events, labs, quality of life scores, and so on) onto a single timeline to get a longitudinal view of the patient. This enables the physician or researcher to visualize a patient's health profile, thereby revealing anomalies, trends, and responses graphically,thus empowering them to treat more effectively. This paper attempts to provide a composite view of a patient by using SAS® Graph Template Language to create a profile graph using the following data elements: key event dates, drugs, adverse events, Quality of Life (QoL) scores. For visualization, the GTL graph uses X and X2 axes for dates, vertical reference lines to represent key dates (for example, when the disease is first diagnosed), horizontal bar plot for duration of drugs taken and adverse events reported, and a series plot at the bottom to show the QoL score.
Eric Brinsfield, SAS
Radhikha Myneni, SAS
Many SAS® procedures use classification variables when they are processing the data. These variables control how the procedure forms groupings, summarizations, and analysis elements. For statistics procedures, they are often used in the formation of the statistical model that is being analyzed. Classification variables can be explicitly specified with a CLASS statement, or they can be specified implicitly from their usage in the procedure. Because classification variables have such a heavy influence on the outcome of so many procedures, it is essential that the analyst have a good understanding of how classification variables are applied. Certainly there are a number of options (system and procedural) that affect how classification variables behave. While you may be aware of some of these options, a great many are new, and some of these new options and techniques are especially powerful. You really need to be open to learning how to program with CLASS.
Art Carpenter, California Occidental Consultants
Can you juggle? Maybe. Can you shuffle a deck of cards? Probably. Can you do both at the same time? Welcome to the world of SAS® and LSF! Very few SAS Administrators start out learning LSF at the same time they learn SAS; most already know SAS, possibly starting out as a programmer or analyst, but now have to step up to an enterprise platform with shared resources. The biggest challenge on an enterprise platform? How to share! How to maximum the utilization of a SAS platform, yet still ensure everyone gets their fair share? This presentation will boil down the 2000+ pages of LSF documentation to provide an introduction into various LSF concepts: * Host * Clusters * Nodes * Queues * First-Come-First-Serve * Fairshare * and various configuration settings: UJOB_LIMIT, PJOB_LIMIT, etc. Plus some insight on where to configure all these settings which are set up by the installation process, and which can be configured by the SAS or LSF administrator. This session is definitely NOT for experts. It is for those about to step into an enterprise deployment of SAS, and want to understand how the SAS server sessions they know so well can run on a shared platform.
Andrew Howell, ANJ Solutions
Duration and severity data arise in several fields including biostatistics, demography, economics, engineering, and sociology. SAS® procedures LIFETEST, LIFEREG. and PHREG are the workhorses for analysis of time to event data in applications in biostatistics. Similar methods apply to the magnitude or severity of a random event, where the outcome might be right, left, or interval censored and/or, right or left truncated. All combinations of types of censoring and truncation could be present in the data set. Regression models such as the accelerated failure time model, the Cox model, and the non-homogeneous Poisson model have extensions to address time-varying covariates in the analysis of clustered outcomes, multivariate outcomes of mixed types, and recurrent events. We present an overview of new capabilities that are available in the procedures QLIM, QUANTLIFE, RELIABILITY, and SEVERITY with examples illustrating their application using empirical data sets drawn from easily accessible sources.
Joseph Gardiner, Michigan State University
In healthcare, we often express our analytics results as being adjusted . For example, you might have read a study in which the authors reported the data as age-adjusted or risk-adjusted. The concept of adjustment is widely used in program evaluation, comparing quality indicators across providers and systems, forecasting incidence rates, and in cost-effectiveness research. In order to make reasonable comparisons across time, place, or population, we need to account for small sample sizes and case-mix variation in other words, we need to level the playing field and account for differences in health status and for uniqueness in a given population. If you are new to healthcare. What it really means to adjust the data in order to make comparisons might not be obvious. In this paper, we explore the methods by which we control for potentially confounding variables in our data. We do so through a series of examples from the healthcare literature in both primary care and health insurance. In this survey of methods, we discuss the concepts of rates and how they can be adjusted for demographic strata (such as age, gender, and race), as well as health risk factors such as case mix.
Greg Nelson, ThotWave
Spinal epidural abscess (SEA) is a serious complication in hemodialysis (HD) patients, yet there is little medical literature that discusses it. This analysis identified risk factors and co-morbidities associated with SEA, as well as risk factors for mortality following the diagnosis. All incident HD cases from the United States Renal Data System for calendar years 2005 2008 were queried for a diagnosis of SEA. Potential clinical covariates, survival, and risk factors were recovered using ICD-9 diagnosis codes. Log-binomial regressions were performed using PROC GENMOD to assess the relative risks, and Cox regression models were run using PROC PHREG to estimate hazard ratios for mortality. For the 4-year study period, 660/355084 (0.19%) HD patients were identified with SEA, the largest cohort to date. Older age (RR=1.625), infectious comorbidities including bacteremia (RR=7.7976), methicillin-resistant Staphylococcus aureus infection (RR=2.6507), hepatitis C (RR=1.545), and non-infectious factors including diabetes (RR=1.514) and presence of vascular catheters (RR=1.348) were identified as significant risk factors for SEA. SEA in HD patients was associated with an increased risk of death (HR=1.20). Older age (HR=2.269), the presence of dialysis catheters (HR=1.884), cirrhosis (HR=1.715), decubitus ulcers (HR=1.669), bacteremia (HR=1.407), and total parenteral nutrition (HR=1.376) constitute the greatest risk factors for death after SEA diagnosis and thus necessitate a comprehensive approach to management.
Stephanie Baer, Georgia Regents University and Augusta VAMC
Rhonda Colombo, Georgia Regents University
Lu Huber, Georgia Regents University
Chan Jin, Georgia Regents University
N. Stanley Nahman Jr., Georgia Regents University and Augusta VAMC
Jennifer White, Georgia Regents University
Guidelines from the International Conference on Harmonisation (ICH) suggest that clinical trial data should be actively monitored to ensure data quality. Traditional interpretation of this guidance has often led to 100 percent source data verification (SDV) of respective case report forms through on-site monitoring. Such monitoring activities can also identify deficiencies in site training and uncover fraudulent behavior. However, such extensive on-site review is time-consuming, expensive and, as is true for any manual effort, limited in scope and prone to error. In contrast, risk-based monitoring makes use of central computerized review of clinical trial data and site metrics to determine whether sites should receive more extensive quality review through on-site monitoring visits. We demonstrate a risk-based monitoring solution within JMP® Clinical to assess clinical trial data quality. Further, we describe a suite of tools used for identifying potentially fraudulent data at clinical sites. Data from a clinical trial of patients who experienced an aneurysmal subarachnoid hemorrhage provide illustration.
Richard Zink, SAS
This discussion uses SAS® Office Analytics as an example to demonstrate the importance of preparing for the SAS® installation. There are many nuances as well as requirements that need to be addressed before you do an installation. These requirements are basically similar, yet they differ according to the target installation operating system. In other words, there are some differences in preparation routines for Windows and *Nix flavors. Our discussion focuses on these three topics: 1. Pre-installation considerations such as sizing, storage, proper credentials, and third-party requirements; 2. Installation steps and requirements; and 3. Post-installation configuration. In addition to preparation, this paper also discusses potential issues and pitfalls to watch out for, as well as best practices.
Rafi Sheikh, Analytiks International, Inc.
Due to XML's growing role in data interchange, it is increasingly important for SAS® programmers to become proficient with SAS technologies and techniques for creating and consuming XML. The current work expands on a SAS® Global Forum 2013 presentation that dealt with these topics providing additional examples of using XML maps to read and write XML files and using the Output Delivery System (ODS) to create custom tagsets for generating XML.
Chris Schacherer, Clinical Data Management Systems, LLC
Statistical mediation analysis is common in business, social sciences, epidemiology, and related fields because it explains how and why two variables are related. For example, mediation analysis is used to investigate how product presentation affects liking the product, which then affects the purchase of the product. Mediation analysis evaluates the mechanism by which a health intervention changes norms that then change health behavior. Research on mediation analysis methods is an active area of research. Some recent research in statistical mediation analysis focuses on extracting accurate information from small samples by using Bayesian methods. The Bayesian framework offers an intuitive solution to mediation analysis with small samples; namely, incorporating prior information into the analysis when there is existing knowledge about the expected magnitude of mediation effects. Using diffuse prior distributions with no prior knowledge allows researchers to reason in terms of probability rather than in terms of (or in addition to) statistical power. Using SAS® PROC MCMC, researchers can choose one of two simple and effective methods to incorporate their prior knowledge into the statistical analysis, and can obtain the posterior probabilities for quantities of interest such as the mediated effect. This project presents four examples of using PROC MCMC to analyze a single mediator model with real data using: (1) diffuse prior information for each regression coefficient in the model, (2) informative prior distributions for each regression coefficient, (3) diffuse prior distribution for the covariance matrix of variables in the model, and (4) informative prior distribution for the covariance matrix.
David MacKinnon, Arizona State University
Milica Miocevic, Arizona State University
Linear regression has been a widely used approach in social and medical sciences to model the association between a continuous outcome and the explanatory variables. Assessing the model assumptions, such as linearity, normality, and equal variance, is a critical step for choosing the best regression model. If any of the assumptions are violated, one can apply different strategies to improve the regression model, such as performing transformation of the variables or using a spline model. SAS® has been commonly used to assess and validate the postulated model and SAS® 9.3 provides many new features that increase the efficiency and flexibility in developing and analyzing the regression model, such as ODS Statistical Graphics. This paper aims to demonstrate necessary steps to find the best linear regression model in SAS 9.3 in different scenarios where variable transformation and the implementation of a spline model are both applicable. A simulated data set is used to demonstrate the model developing steps. Moreover, the critical parameters to consider when evaluating the model performance are also discussed to achieve accuracy and efficiency.
Ning Huang, University of Southern California
SAS® OLAP technology is used to organize and present summarized data for business intelligence applications. It features flexible options for creating and storing aggregations to improve performance and brings a powerful multi-dimensional approach to querying data. This paper focuses on managing security features available to OLAP cubes through the combination of SAS metadata and MDX logic.
Stephen Overton, Overton Technologies, LLC
The Purchasing Department is considering contracting with your team for a new SAS® Enterprise BI application. He's already met with SAS® and seen the sales pitch, and he is very interested. But the manager is a tightwad and not sure about spending the money. Also, he wants his team to be the primary developers for this new application. Before investing his money on training, programming, and support, he would like a proof-of-concept. This paper will walk you through the seven steps to create a SAS Enterprise BI POC project: Develop a kick-off meeting including a full demo of the SAS Enterprise BI tools. Set up your UNIX file systems and security. Set up your SAS metadata ACTs, users, groups, folders, and libraries. Make sure the necessary SAS client tools are installed on the developers machines. Hold a SAS Enterprise BI workshop to introduce them to the basics, including SAS® Enterprise Guide®, SAS® Stored Processes, SAS® Information Maps, SAS® Web Report Studio, SAS® Information Delivery Portal, and SAS® Add-In for Microsoft Office, along with supporting documentation. Work with them to develop a simple project, one that highlights the benefits of SAS Enterprise BI and shows several methods for achieving the desired results. Last but not least, follow up! Remember, your goal is not to launch a full-blown application. Instead, we ll strive toward helping them see the potential in your organization for applying this methodology.
Sheryl Weise, Wells Fargo
Big data is all the rage these days, with the proliferation of data-accumulating electronic gadgets and instrumentation. At the heart of big data analytics is the MapReduce programming model. As a framework for distributed computing, MapReduce uses a divide-and-conquer approach to allow large-scale parallel processing of massive data. As the name suggests, the model consists of a Map function, which first splits data into key-value pairs, and a Reduce function, which then carries out the final processing of the mapper outputs. It is not hard to see how these functions can be simulated with the SAS® hash objects technique, and in reality, implemented in the new SAS® DS2 language. This paper demonstrates how hash object programming can handle data in a MapReduce fashion and shows some potential applications in physics, chemistry, biology, and finance.
Joseph Hinson, Accenture Life Sciences
Have you ever needed to use dates as values to loop through a table? For example, how many events occurred by 1, 2 , 3 & n months ahead? Maybe you just changed the dates manually and re-ran the query n times? This is a common need in economic and behavioral sciences. This presentation demonstrates how to create a table of dates that can be used with SAS® macro variables to loop through a table. Using this dates table in combination with the SAS DO loop ensures accuracy and saves time.
Scott Fawver, Arch Mortgage Insurance Company
Systematic reviews have become increasingly important in healthcare, particularly when there is a need to compare new treatment options and to justify clinical effectiveness versus cost. This paper describes a method in SAS/STAT® 9.2 for computing weighted averages and weighted standard deviations of clinical variables across treatment options while correctly using these summary measures to make accurate statistical inference. The analyses of data from systematic reviews typically involve computations of weighted averages and comparisons across treatment groups. However, the application of the TTEST procedure does not currently take into account weighted standard deviations when computing p-values. The use of a default non-weighted standard deviation can lead to incorrect statistical inference. This paper introduces a method for computing correct p-values using weighted averages and weighted standard deviations. Given a data set containing variables for three treatment options, we want to make pairwise comparisons of three independent treatments. This is done by creating two temporary data sets using PROC MEANS, which yields the weighted means and weighted standard deviations. Subsequently, we then perform a t-test on each temporary data set.The resultant data sets containing all comparisons of each treatment options are merged and then transposed to obtain the necessary statistics. The resulting output provides pairwise comparisons of each treatment option and uses the weighted standard deviations to yield the correct p-values in a desired format. This method allows the use of correct weighted standard deviations using PROC MEANS and PROC TTEST in summarizing data from a systematic review while providing correct p-values.
Ravi Tejeshwar Reddy Gaddameedi, California State University,Eastbay
Usha Kreaden, Intuitive Surgical
One in every four people dies of heart disease in the United States, and stress is an important factor which contributes towards a cardiac event. As the condition of the heart gradually worsens with age, the factors that lead to a myocardial infarction when the patients are subjected to stress are analyzed. The data used for this project was obtained from a survey conducted through the Department of Biostatistics at Vanderbilt University. The objective of this poster is to predict the chance of survival of a patient after a cardiac event. Then by using decision trees, neural networks, regression models, bootstrap decision trees, and ensemble models, we predict the target which is modeled as a binary variable, indicating whether a person is likely to survive or die. The top 15 models, each with an accuracy of over 70%, were considered. The model will give important survival characteristics of a patient which include his history with diabetes, smoking, hypertension, and angioplasty.
Yogananda Domlur Seetharama, Oklahoma State university
Sai Vijay Kishore Movva, Oklahoma State University
'Can I have that in Excel?' This is a request that makes many of us shudder. Now your boss has discovered Microsoft Excel pivot tables. Unfortunately, he has not discovered how to make them. So you get to extract the data, massage the data, put the data into Excel, and then spend hours rebuilding pivot tables every time the corporate data is refreshed. In this workshop, you learn to be the armchair quarterback and build pivot tables without leaving the comfort of your SAS® environment. In this workshop, you learn the basics of Excel pivot tables and, through a series of exercises, you learn how to augment basic pivot tables first in Excel, and then using SAS. No prior knowledge of Excel pivot tables is required.
Peter Eberhardt, Fernwood Consulting Group Inc.
For decades, SAS® has been the cornerstone of many organizations for business reporting. In more recent times, the ability to quickly determine the performance of an organization through the use of dashboards has become a requirement. Different ways of providing dashboard capabilities are discussed in this paper: using out-of-the-box solutions such as SAS® Visual Analytics and SAS® BI Dashboard, through to alternative solutions using SAS® Stored Processes, batch processes, and SAS® Integration Technologies. Extending the available indicators is also discussed, using Graph Template Language and KPI indicators provided with Base SAS®, as well as alternatives such as Google Charts and Flash objects. Real-world field experience, problem areas, solutions, and tips are shared, along with live examples of some of the different methods.
Mark Bodt, The Knowledge Warehouse (Knoware)
The SAS® Enterprise Guide® Query Builder is one of the most powerful components of the software. It enables a user to bring in data, join, drop and add columns, compute new columns, sort, filter data, leverage the advanced expression builder, change column attributes, and more! This presentation provides an overview of the major features of this powerful tool and how to leverage it every day.
Steven First, Systems Seminar Consultants
Jennifer First-Kluge, Systems Seminar Consultants
This is the way I have always done it and it works fine for me. Have you heard yourself or others say this when someone suggests a new technique to help solve a problem? Most of us have a set of tricks and techniques from which we draw when starting a new project. Over time we might overlook newer techniques because our old toolkit works just fine. Sometimes we actively avoid new techniques because our initial foray leaves us daunted by the steep learning curve to mastery. For me, the PRX functions and the SAS® hash object fell into this category. In this workshop, we address possible objections to learning to use the SAS hash object. We start with the fundamentals of setting up the hash object and work through a variety of practical examples to help you master this powerful technique.
Peter Eberhardt, Fernwood Consulting Group Inc.
As a longtime Base SAS® programmer, whether to use a different application for programming is a constant question when powerful applications such as SAS® Enterprise Guide® are available. This paper provides some important tips for a programmer, such as the best way to use the code window and how to take advantage of system-generated code in SAS Enterprise Guide 5.1. This paper also explains the differences between some of the functions and procedures in Base SAS and SAS Enterprise Guide. It highlights features in SAS Enterprise Guide such as process flow, data access management, and report automation, including formatting using XML tag sets.
Anjan Matlapudi, AmerihealthCaritas
This paper gives you a better idea of how and where to use the record lookup functions to locate observations where a variable has some characteristic. Various related functions are illustrated to search numeric and character values in this process. Code is shown with time comparisons. I will discuss three possible ways to retrieve records using the SAS® DATA step, PROC SQL, and Perl regular expressions. Real and CPU time processing issues will be highlighted when comparing to retrieve records using these methods. Although the program is written for the PC using SAS® 9.2 in a Windows XP 32-bit environment, all the functions are applicable to any system. All the tools discussed are in Base SAS®. The typical attendee or reader will have some experience in SAS, but not a lot of experience dealing with large amount of data.
Anjan Matlapudi, Amerihealth Critas
SAS® Add-In for Microsoft Office remains a popular tool for people who are not SAS® programmers due to its easy interface with the SAS servers. In this session, you'll learn some of the many tricks that other organizations use for getting more value out of the tool.
Tricia Aanderud, And Data Inc
The DOW-loop is not official terminology that one can find in SAS® documentation, but it has been well known and widely used among experienced SAS programmers. The DOW-loop was developed over a decade ago by a few SAS gurus, including Don Henderson, Paul Dorfman, and Ian Whitlock. A common construction of the DOW-loop consists of a DO-UNTIL loop with a SET and a BY statement within the loop. This construction isolates actions that are performed before and after the loop from the action within the loop, which results in eliminating the need for retaining or resetting the newly created variables to missing in the DATA step. In this talk, in addition to explaining the DOW-loop construction, we review how to apply the DOW-loop to various applications.
Arthur Li, City of Hope
You have built the simple bar chart and mastered the art of layering multiple plot statements to create complex graphs like the Survival Plot using the SGPLOT procedure. You know all about how to use plot statements creatively to get what you need and how to customize the axes to achieve the look and feel you want. Now it s time to up your game and step into the realm of the Graphics Wizard. Behold the magical powers of Graph Template Language Layouts! Here you will learn the esoteric art of creating complex multi-cell graphs using LAYOUT LATTICE. This is the incantation that gives you the power to build complex, multi-cell graphs like the Forest plot, Stock plots with multiple indicators like MACD and Stochastics, Adverse Events by Relative Risk graphs, and more. If you ever wondered how the Diagnostics panel in the REG procedure was built, this paper is for you. Be warned, this is not the realm for the faint of heart!
Sanjay Matange, SAS
When reading data files or writing SAS® programs, we are often hunting for the right format or informat. There are so many to choose from! Does it seem like too many to search the manual? Let SAS help find the right one! We use the SAS dictionary table VFORMAT and a very small SAS program. This presentation demonstrates how two simple functions unlock the potential of this great resource: SASHELP.VFORMAT.
Peter Crawford, Crawford Software Consultancy Limited
Researchers often rely on self-report for survey based studies. The accuracy of this self-reported data is often unknown, particularly in a medical setting that serves an under-insured patient population with varying levels of health literacy. We recruited participants from the waiting room of a St. Louis primary care safety net clinic to participate in a survey investigating the relationship between health environments and health outcomes. The survey included questions regarding personal and family history of chronic disease (diabetes, heart disease, and cancer) as well as BMI and self-perceived weight. We subsequently accessed the participant s electronic medical record (EMR) and collected physician-reported data on the same variables. We calculated concordance rates between participant answers and information gathered from EMRs using McNemar s chi-squared test. Logistic regression was then performed to determine the demographic predictors of concordance. Three hundred thirty-two patients completed surveys as part of the pilot phase of the study; 64% female, 58% African American, 4% Hispanic, 15% with less than high school level education, 76% annual household income less than $20,000, and 29% uninsured. Preliminary findings suggest an 82-94% concordance rate between self-reported and medical record data across outcomes, with the exception of family history of cancer (75%) and heart disease (42%). Our conclusion is that determining the validity of the self-reported data in the pilot phase influences whether self-reported personal and family history of disease and BMI are appropriate for use in this patient population.
Melody Goodman, Washington University in St. Louis School of Medicine
Kimberly Kaphingst, Washington University School of Medicine
Sarah Lyons, Washington University School of Medicine
The world's first wind resource assessment buoy, residing in Lake Michigan, uses a pulsing laser wind sensor to accurately measure wind speed, direction, and turbulence offshore up to wind turbine hub-height and across the blade span every second. Understanding wind behavior would be tedious and fatiguing with such large data sets. However, SAS/GRAPH® 9.4 helps the user grasp wind characteristics over time and at different altitudes by exploring the data visually. This paper covers graphical approaches to evaluate wind speed validity, seasonal wind speed variation, and storm systems to inform engineers on the candidacy of Lake Michigan offshore wind farms.
Aaron Clark, Grand Valley State University
Missing observations caused by dropouts or skipped visits present a problem in studies of longitudinal data. When the analysis is restricted to complete cases and the missing data
depend on previous responses, the generalized estimating equation (GEE) approach, which is commonly used when the population-average effect is of primary interest, can lead to biased parameter estimates.
The new GEE procedure in SAS/STAT® 13.2 implements a weighted GEE method, which provides consistent parameter estimates when the dropout mechanism is correctly specified. When
none of the data are missing, the method is identical to the usual GEE approach, which is available in the GENMOD procedure. This paper reviews the concepts and statistical methods. Examples illustrate
how you can apply the GEE procedure to incomplete longitudinal data.
Guixian Lin, SAS
Bob Rodriguez, SAS