SAS Global Forum 2016 Proceedings

Geographically Weighted Negative Binomial Regression (GWNBR) was developed by Silva and Rodrigues (2014). It is a generalization of the Geographically Weighted Poisson Regression (GWPR) proposed by Nakaya and others (2005) and of the Poisson and negative binomial regressions. This paper shows a SAS^® macro to estimate the GWNBR model encoded in SAS/IML^® software and shows how to use the SAS procedure GMAP to draw the maps.

Read the paper (PDF) | Download the data file (ZIP)

Amazon Web Services (AWS) as a platform for analytics and data warehousing has gained significant adoption over the years. With SAS^® Visual Analytics being one of the preferred tools for data visualization and analytics, it is imperative to be able to deploy SAS Visual Analytics on AWS. This ensures swift analysis and reporting on large amounts of data with SAS Visual Analytics by minimizing the movement of data across environments. This paper focuses on installing SAS Visual Analytics 7.2 in an Amazon Web Services environment, migration of metadata objects and content from previous versions to the deployment on the cloud, and ensuring data security.

Read the paper (PDF)

In the Collection Direction of a well-recognized Colombian financial institution, there was no methodology that provided an optimal number of collection agents to improve the collection task and make possible that more customers be compromised with their minimum monthly payment of their debt. The objective of this paper is to apply the Data Envelopment Analysis (DEA) Optimization Methodology to determine the optimal number of agents to maximize the monthly collection in the bank. We show that the results can have a positive impact to the credit portfolio behavior and reduce the collection management cost. DEA optimization methodology has been successfully used in various fields to solve multi-criteria optimization problems, but it is not commonly used in the financial sector mostly because this methodology requires specialized software, such as SAS^® Enterprise Guide^®. In this paper, we present the PROC OPTMODEL and we show how to formulate the optimization problem, program the SAS^® Code, and how to process adequately the available data.

Read the paper (PDF) | Download the data file (ZIP)

At Royal Bank of Scotland, business intelligence users require sophisticated security permissions both at object level and data (row) level in order to comply with data security, audit, and regulatory requirements. When we rolled out SAS^® Visual Analytics to our two main stakeholder groups, this was identified as a key requirement as data is no longer restricted to the desktop but is increasingly available on mobile devices such as tablets and smart phones. Implementing row-level security (RLS) controls, in addition to standard security measures such as authentication, is a most effective final layer in your data authorization process. RLS procedures in leading relational database management systems (RDBMSs) and business intelligence (BI) software are fairly commonplace, but with the emergence of big data and in-memory visualization tools such as SAS^® Visual Analytics, those RLS procedures now need to be extended to the memory interface. Identity-driven row-level security is a specific RLS technique that enables the same report query to retrieve different sets of data in accordance with the varying security privileges afforded to the respective users. This paper discusses an automated framework approach for applying identity-driven RLS controls on SAS^® Visual Analytics and our plans to implement a generic end-to-end RLS framework extended to the Teradata data warehouse.

Read the paper (PDF)

This paper demonstrates the deployment of SAS^® Visual Analytics 7.2 with a distributed SAS^® LASR™ Analytic Server and an internet-facing web tier. The key factor in this deployment is to establish the secure web server connection using a third-party certificate to perform the client and server authentication. The deployment process involves the following steps: 1) Establish the analytics cluster, which consists of SAS^® High-Performance Deployment of Hadoop and the deployment of high-performance analytics environment master and data nodes. 2) Get the third-party signed certificate and the key files. 3) Deploy the SAS Visual Analytics server tier and middle tier. 4) Deploy the standalone web tier with HTTP protocol configured using secure sockets. 5) Deploy the SAS^® Web Infrastructure Platform. 6) Perform post-installation validation and configuration to handle the certificate between the servers.

Read the paper (PDF)

The power of SAS^®9 applications allows information and knowledge creation from very large amounts of data. Analysis that used to consist of 10s to 100s of gigabytes (GBs) of supporting data has rapidly grown into the 10s to 100s of terabytes (TBs). This data expansion has resulted in more and larger SAS^® data stores. Setting up file systems to support these large volumes of data with adequate performance, as well as ensuring adequate storage space for the SAS temporary files, can be very challenging. Technology advancements in storage and system virtualization, flash storage, and hybrid storage management require continual updating of best practices to configure IO subsystems. This paper presents updated best practices for configuring the IO subsystem for your SAS^®9 applications, ensuring adequate capacity, bandwidth, and performance for your SAS^®9 workloads. We have found that very few storage systems work ideally with SAS with their out-of-the-box settings, so it is important to convey these general guidelines.

Read the paper (PDF)

Architects do not see a single architectural solution, such as SAS^® Grid Manager, satisfying the varied needs of users across the enterprise. Multi-tenant environments need to support data movement (such as ETL), analytic processing (such as forecasting and predictive modeling), and reporting, which can include everything from visual data discovery to standardized reporting. SAS^® users have a myriad of choices that might seem at odds with one another, such as in-database versus in-memory, or data warehouses (tightly structured schemes) versus data lakes and event stream processing. Whether fit for purpose or for potential, these choices force us as architects to modernize our thinking about the appropriateness of architecture, configuration, monitoring, and management. This paper discusses how SAS^® Grid Manager can accommodate the myriad use cases and the best practices used in large-scale, multi-tenant SAS environments.

Read the paper (PDF) | Watch the recording

Higher education is taking on big data initiatives to help improve student outcomes and operational efficiencies. This paper shows how one institution went form White Paper to action. Challenges such as developing a common definition of big data, access to certain types of data, and bringing together institutional groups who had little previous contact are highlighted. The stepwise, collaborative approach taken is highlighted. How the data was eventually brought together and analyzed to generate actionable results is also covered. Learn how to make big data a part of your analytics toolbox.

Read the paper (PDF)

Credit card profitability prediction is a complex problem because of the variety of card holders' behavior patterns and the different sources of interest and transactional income. Each consumer account can move to a number of states such as inactive, transactor, revolver, delinquent, or defaulted. This paper i) describes an approach to credit card account-level profitability estimation based on the multistate and multistage conditional probabilities models and different types of income estimation, and ii) compares methods for the most efficient and accurate estimation. We use application, behavioral, card state dynamics, and macroeconomic characteristics, and their combinations as predictors. We use different types of logistic regression such as multinomial logistic regression, ordered logistic regression, and multistage conditional binary logistic regression with the LOGISTIC procedure for states transition probability estimation. The state transition probabilities are used as weights for interest rate and non-interest income models (which one is applied depends on the account state). Thus, the scoring model is split according to the customer behavior segment and the source of generated income. The total income consists of interest and non-interest income. Interest income is estimated with the credit limit utilization rate models. We test and compare five proportion models with the NLMIXED, LOGISTIC, and REG procedures in SAS/STAT^® software. Non-interest income depends on the probability of being in a particular state, the two-stage model of conditional probability to make a point-of-sales transaction (POS) or cash withdrawal (ATM), and the amount of income generated by this transaction. We use the LOGISTIC procedure for conditional probability prediction and the GLIMMIX and PANEL procedures for direct amount estimation with pooled and random-effect panel data. The validation results confirm that traditional techniques can be effectively applied to complex tasks with many para meters and multilevel business logic. The model is used in credit limit management, risk prediction, and client behavior analytics.

Read the paper (PDF)

Customer lifetime value (LTV) estimation involves two parts: the survival probabilities and profit margins. This article describes the estimation of those probabilities using discrete-time logistic hazard models. The estimation of profit margins is based on linear regression. In the scenario, when outliers are present among margins, we suggest applying robust regression with PROC ROBUSTREG.

Read the paper (PDF) | View the e-poster or slides (PDF)

Metadata is an integral and critical part of any environment. Metadata facilitates resource discovery and provides unique identification of every single digital component of a system, simple to complex. SAS^® Visual Analytics, one of the most powerful analytics visualization platforms, leverages the power of metadata to provide a plethora of functionalities for all types of users. The possibilities range from real-time advanced analytics and power-user reporting to advanced deployment features for a robust and scalable distributed platform to internal and external users. This paper explains the best practices and advanced approaches for designing and managing metadata for a distributed global SAS Visual Analytics environment. Designing and building the architecture of such an environment requires attention to important factors like user groups and roles, access management, data protection, data volume control, performance requirements, and so on. This paper covers how to build a sustainable and scalable metadata architecture through a top-down hierarchical approach. It helps SAS Visual Analytics Data Administrators to improve the platform benchmark through memory mapping, perform administrative data load (AUTOLOAD, Unload, Reload-on-Start, and so on), monitor artifacts of distributed SAS^® LASR™ Analytic Servers on co-located Hadoop Distributed File System (HDFS), optimize high-volume access via FullCopies, build customized FLEX themes, and so on. It showcases practical approaches to managing distributed SAS LASR Analytic Servers, offering guest access for global users, managing host accounts, enabling Mobile BI, using power-user reporting features, customizing formats, enabling home page customization, using best practices for environment migration, and much more.

Read the paper (PDF)

The popular MIXED, GLIMMIX, and NLMIXED procedures in SAS/STAT^® software fit linear, generalized linear, and nonlinear mixed models, respectively. These procedures take the classical approach of maximizing the likelihood function to estimate model. The flexible MCMC procedure in SAS/STAT can fit these same models by taking a Bayesian approach. Instead of maximizing the likelihood function, PROC MCMC draws samples (using a variety of sampling algorithms) to approximate the posterior distributions of model parameters. Similar to the mixed modeling procedures, PROC MCMC provides estimation, inference, and prediction. This paper describes how to use the MCMC procedure to fit Bayesian mixed models and compares the Bayesian approach to how the classical models would be fit with the familiar mixed modeling procedures. Several examples illustrate the approach in practice.

Read the paper (PDF)

SAS^® Federated Query Language (FedSQL) is a SAS proprietary implementation of the ANSI SQL:1999 core standard capable of providing a scalable, threaded, high-performance way to access relational data in multiple data sources. This paper explores the FedSQL language in a three-step approach. First, we introduce the key features of the language and discuss the value each feature provides. Next, we present the FEDSQL procedure (the Base SAS^® implementation of the language) and compare it to PROC SQL in terms of both syntax and performance. Finally, we examine how FedSQL queries can be embedded in DS2 programs to merge the power of these two high-performance languages.

Read the paper (PDF)

You can use annotation, modify templates, and change dynamic variables to customize graphs in SAS^®. Standard graph customization methods include template modification (which most people use to modify graphs that analytical procedures produce) and SG annotation (which most people use to modify graphs that procedures such as PROC SGPLOT produce). However, you can also use SG annotation to modify graphs that analytical procedures produce. You begin by using an analytical procedure, ODS Graphics, and the ODS OUTPUT statement to capture the data that go into the graph. You use the ODS document to capture the values that the procedure sets for the dynamic variables, which control many of the details of how the graph is created. You can modify the values of the dynamic variables, and you can modify graph and style templates. Then you can use PROC SGRENDER along with the ODS output data set, the captured or modified dynamic variables, the modified templates, and SG annotation to create highly customized graphs. This paper shows you how and provides examples.

Read the paper (PDF) | Download the data file (ZIP)

Have you ever wondered how to get the most from Web 2.0 technologies in order to visualize SAS^® data? How to make those graphs dynamic, so that users can explore the data in a controlled way, without needing prior knowledge of SAS products or data science? Wonder no more! In this session, you learn how to turn basic sashelp.stocks data into a snazzy HighCharts stock chart in which a user can review any time period, zoom in and out, and export the graph as an image. All of these features with only two DATA steps and one SORT procedure, for 57 lines of SAS code.

Download the data file (ZIP) | View the e-poster or slides (PDF)

Macroeconomic simulation analysis provides in-depth insights to a portfolio's performance spectrum. Conventionally, portfolio and risk managers obtain macroeconomic scenarios from third parties such as the Federal Reserve and determine portfolio performance under the provided scenarios. In this paper, we propose a technique to extend scenario analysis to an unconditional simulation capturing the distribution of possible macroeconomic climates and hence the true multivariate distribution of returns. We propose a methodology that adds to the existing scenario analysis tools and can be used to determine which types of macroeconomic climates have the most adverse outcomes for the portfolio. This provides a broader perspective on value at risk measures thereby allowing more robust investment decisions. We explain the use of SAS^® procedures like VARMAX and PROC COPULA in SAS/IML^® in this analysis.

Read the paper (PDF)

Although limited to a small fraction of health care providers, the existence and magnitude of fraud in health insurance programs requires the use of fraud prevention and detection procedures. Data mining methods are used to uncover odd billing patterns in large databases of health claims history. Efficient fraud discovery can involve the preliminary step of deploying automated outlier detection techniques in order to classify identified outliers as potential fraud before an in-depth investigation. An essential component of the outlier detection procedure is the identification of proper peer comparison groups to classify providers as within-the-norm or outliers. This study refines the concept of peer comparison group within the provider category and considers the possibility of distinct billing patterns associated with medical or surgical procedure codes identifiable by the Berenson-Eggers Type of System (BETOS). The BETOS system covers all HCPCS codes (Health Care Procedure Coding System); assigns a HCPCS code to only one BETOS code; consists of readily understood clinical categories; consists of categories that permit objective assignment& (Center for Medicare and Medicaid Services, CMS). The study focuses on the specialty General Practice and involves two steps: first, the identification of clusters of similar BETOS-based billing patterns; and second, the assessment of the effectiveness of these peer comparison groups in identifying outliers. The working data set is a sample of the summary of 2012 data of physicians active in health care government programs made publicly available by the CMS through its website. The analysis uses PROC FASTCLUS, the SAS^® cubic clustering criterion approach, to find the optimal number of clusters in the data. It also uses PROC ROBUSTREG to implement a multivariate adaptive threshold outlier detection method.

Read the paper (PDF) | Download the data file (ZIP)

At Royal Bank of Scotland, one of our key organizational design principles is to 'share everything we can share.' In essence, this promotes the cross-departmental sharing of platform services. Historically, this was never enforced on our Business Intelligence platforms like SAS^®, resulting in a diverse technology estate, which presents challenges to our platform team for maintaining software currency, software versions, and overall quality of service. Currently, we have SAS^® 8.2 and SAS^® 9.1.3 on the mainframe, SAS^® 9.2, SAS^® 9.3, and SAS^® 9.4 across our Windows and Linux servers, and SAS^® 9.1.3 and SAS^® 9.4 on PC across the bank. One of the benefits to running a multi-tenant SAS environment is removing the need to procure, install, and configure a new environment when a new department wants to use SAS. However, the process of configuring a secure multi-tenant environment, using the default tools and procedures, can still be very labor intensive. This paper explains how we analyzed the benefits of creating a shared Enterprise Business Intelligence platform in SAS alongside the risks and organizational barriers to the approach. Several considerations are presented as well as some insight into how we managed to convince our key stakeholders with the approach. We also look at the 'custom' processes and tools that RBS has implemented. Through this paper, we encourage other organizations to think about the various considerations we present to decide if sharing is right for their context to maximize the return on investment in SAS.

Read the paper (PDF)

Optimization models require continuous constraints to converge. However, some real-life problems are better described by models that incorporate discontinuous constraints. A common type of such discontinuous constraints becomes apparent when a regulation-mandated diversification requirement is implemented in an investment portfolio model. Generally stated, the requirement postulates that the aggregate of investments with individual weights exceeding certain threshold in the portfolio should not exceed some predefined total within the portfolio. This format of the diversification requirement can be defined by the rules of any specific portfolio construction methodology and is commonly imposed by the regulators. The paper discusses the impact of this type of discontinuous portfolio diversification constraint on the portfolio optimization model solution process, and develops a convergent approach. The latter includes a sequence of definite series of convergent non-linear optimization problems and is presented in the framework of the OPTMODEL procedure modeling environment. The approach discussed has been used in constructing investable equity indexes.

Read the paper (PDF) | Download the data file (ZIP) | View the e-poster or slides (PDF)

Due to advances in medical care and the rise in living standards, life expectancy increased on average to 79 years in the US. This resulted in an increase in aging populations and increased demand for development of technologies that aid elderly people to live independently and safely. It is possible to achieve this challenge through ambient-assisted living (AAL) technologies that help elderly people live more independently. Much research has been done on human activity recognition (HAR) in the last decade. This research work can be used in the development of assistive technologies, and HAR is expected to be the future technology for e-health systems. In this research, I discuss the need to predict human activity accurately by building various models in SAS^® Enterprise Miner™ 14.1 on a free public data set that contains 165,633 observations and 19 attributes. Variables used in this research represent the metrics of accelerometers mounted on the waist, left thigh, right arm, and right ankle of four individuals performing five different activities, recorded over a period of eight hours. The target variable predicts human activity such as sitting, sitting down, standing, standing up, and walking. Upon comparing different models, the winner is an auto neural model whose input is taken from stepwise logistic regression, which is used for variable selection. The model has an accuracy of 98.73% and sensitivity of 98.42%.

Read the paper (PDF) | View the e-poster or slides (PDF)

In a data warehousing system, change data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also in providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This and slowly changing dimensions (SCD) create a cycle that runs the DWH and provides valuable insights in the history and for the decision-making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth. If these two processes can be combined into a single process where just one single transform does both jobs of identifying the change and applying the change to the DWH, then we can save significant processing times and value resources of the system. Hence, I came up with a hybrid SCD with CDC approach for this. My paper focuses on sources that DO NOT have CDC in their sources and need to perform SCD Type 2 on such records without worrying about data duplications and increased processing times.

Read the paper (PDF) | Watch the recording

This paper describes the merging of designed experiments and regularized regression to find the significant factors to use for a real-time predictive model. The real-time predictive model is used in a Weyerhaeuser modified fiber mill to estimate the value of several key quality characteristics. The modified fiber product is accepted or rejected based on the model prediction or the predicted values' deviation from lab tests. To develop the model, a designed experiment was needed at the mill. The experiment was planned by a team of engineers, managers, operators, lab techs, and a statistician. The data analysis used the actual values of the process variables manipulated in the designed experiment. There were instances of the set point not being achieved or maintained for the duration of the run. The lab tests of the key quality characteristics were used as the responses. The experiment was designed with JMP^® 64-bit Edition 11.2.0 and estimated the two-way interactions of the process variables. It was thought that one of the process variables was curvilinear and this effect was also made estimable. The LASSO method, as implemented in the SAS^® GLMSELECT procedure, was used to analyze the data after cleaning and validating the results. Cross validation was used as the variable selection operator. The resulting prediction model passed all the predetermined success criteria. The prediction model is currently being used real time in the mill for product quality acceptance.

Read the paper (PDF)

The success of any marketing promotion is measured by the incremental response and revenue generated by the targeted population known as Test in comparison with the holdout sample known as Control. An unbiased random Test and Control sampling ensures that the incremental revenue is in fact driven by the marketing intervention. However, isolating the true incremental effect of any particular marketing intervention becomes increasingly challenging in the face of overlapping marketing solicitations. This paper demonstrates how a look-alike model can be applied using the GMATCH algorithm on a SAS^® platform to design a truly comparable control group to accurately measure and isolate the impact of a specific marketing intervention.

Read the paper (PDF) | Download the data file (ZIP)

This SAS^® test begins where the SAS^® Advanced Certification test leaves off. It has 25 questions to challenge even the most experienced SAS programmers. Most questions are about the DATA step, but there are a few macro and SAS/ACCESS^® software questions thrown in. The session presents each question on a slide, then the answer on the next. You WILL be challenged.

View the e-poster or slides (PDF)

Inherently, mixed modeling with SAS/STAT^® procedures (such as GLIMMIX, MIXED, and NLMIXED) is computationally intensive. Therefore, considerable memory and CPU time can be required. The default algorithms in these procedures might fail to converge for some data sets and models. This encore presentation of a paper from SAS Global Forum 2012 provides recommendations for circumventing memory problems and reducing execution times for your mixed-modeling analyses. This paper also shows how the new HPMIXED procedure can be beneficial for certain situations, as with large sparse mixed models. Lastly, the discussion focuses on the best way to interpret and address common notes, warnings, and error messages that can occur with the estimation of mixed models in SAS^® software.

Read the paper (PDF) | Watch the recording

Many SAS^® procedures can be used to analyze large amounts of correlated data. This study was a secondary analysis of data obtained from the South Carolina Revenue and Fiscal Affairs Office (RFA). The data includes medical claims from all health care systems in South Carolina (SC). This study used the SAS procedure GENMOD to analyze a large amount of correlated data about Military Health Care (MHS) system beneficiaries who received behavioral health care in South Carolina Health Care Systems from 2005 to 2014. Behavioral health (BH) was defined by Major Diagnostic Code (MDC) 19 (mental disorders and diseases) and 20 (alcohol/drug use). MDCs are formed by dividing all possible principal diagnoses from the International Classification Diagnostic (ICD-9) codes into 25 mutually exclusive diagnostic categories. The sample included a total of 6,783 BH visits and 4,827 unique adult and child patients that included military service members, veterans, and their adult and child dependents who have MHS insurance coverage. PROC GENMOD included a multivariate GEE model with type of BH visit (mental health or substance abuse) as the dependent variable, and with gender, race group, age group, and discharge year as predictors. Hospital ID was used in the repeated statement with different correlation structures. Gender was significant with both independent correlation (p = .0001) and exchangeable structure (p = .0003). However, age group was significant using the independent correlation (p = .0160), but non-significant using the exchangeable correlation structure (p = .0584). SAS is a powerful statistical program for analyzing large amounts of correlated data with categorical outcomes.

Read the paper (PDF) | View the e-poster or slides (PDF)

Sensitive data requires elevated security requirements and the flexibility to apply logic that subsets data based on user privileges. Following the instructions in SAS^® Visual Analytics: Administration Guide gives you the ability to apply row-level permission conditions. After you have set the permissions, you have to prove through audits who has access and row-level security. This paper provides you with the ability to easily apply, validate, report, and audit all tables that have row-level permissions, along with the groups, users, and conditions that will be applied. Take the hours of maintenance and lack of visibility out of row-level secure data and build confidence in the data and analytics that are provided to the enterprise.

Read the paper (PDF) | Download the data file (ZIP)

Our daily work in SAS^® involves manipulation of many independent data sets, and a lot of time can be saved if independent data sets can be manipulated simultaneously. This paper presents our interface RunParallel, which opens multiple SAS sessions and controls which SAS procedures and DATA steps to run on which sessions by parsing comments such as /*EP.SINGLE*/ and /*EP.END*/. The user can easily parallelize any code by simply wrapping procedure steps and DATA steps in such comments and executing in RunParallel. The original structure of the SAS code is preserved so that it can be developed and run in serial regardless of the RunParallel comments. When applied in SAS programs with many external data sources and heavy computations, RunParallel can give major performance boosts. Among our examples we include a simulation that demonstrates how to run DATA steps in parallel, where the performance gain greatly outweighs the single minute it takes to add RunParallel comments to the code. In a world full a big data, a lot of time can be saved by running in parallel in a comprehensive way.

Read the paper (PDF)

One of the truths about SAS^® is that there are, at a minimum, three approaches to achieve any intended task and each approach has its own pros and cons. Identifying and using efficient SAS programming techniques are recommended and efficient programming becomes mandatory when larger data sets are accessed. This paper describes the efficiency of virtual access and various situations to use virtual access of data sets using OPEN, FETCH and CLOSE functions with %SYSFUNC and %DO loops. There are several ways to get data set attributes like number of observations, variables, types of variables, and so on. It becomes more efficient to access larger data sets using the OPEN function with %SYSFUNC. The FETCH with OPEN function gives the ability to access the values of the variables from a SAS data set for conditional and iterative executions with %DO loops. In the following situations, virtual access of the SAS data set becomes more efficient on larger data sets. Situation 1: It is often required to split the master data set into several pieces depending upon a certain criterion for the batch submission, as it is required that data sets be independent. Situation 2: For many reports and dashboards and for the array process, it is required to keep the relevant variables together instead of maintaining the original data set order. Situation 3: For most of the statistical analysis, particularly for correlation and regression, a widely and frequently used SAS procedure requires a dynamic variable list to be passed. Creating a single macro variable list might run into an issue with the macro variable length on the larger transactional data sets. It is recommended that you prepare the list of the variables as a data set and access them using the proposed approach in many places, instead of creating several macro variables.