SAS Global Forum 2016 Proceedings

There are currently thousands of Jamaican citizens that lack access to basic health care. In order to improve the health-care system, I collect and analyze data from two clinics in remote locations of the island. This report analyzes data collected from Clarendon Parish, Jamaica. In order to create a descriptive analysis, I use SAS^® Studio 9.4. A few of the procedures I use include: PROC IMPORT, PROC MEANS, PROC FREQ, and PROC GCHART. After conducting the aforementioned procedures, I am able to produce a descriptive analysis of the health issues plaguing the island.

Read the paper (PDF) | View the e-poster or slides (PDF)

In health sciences and biomedical research we often search for scientific journal abstracts and publications within MEDLINE, which is a suite of indexed databases developed and maintained by the National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM). PubMed is a free search engine within MEDLINE and has become one of the standard databases to search for scientific abstracts. Entrez is the information retrieval system that gives you direct access to the 40 databases with over 1.3 billion records within the NCBI. You can access these records by using the eight e-utilities (einfo, esearch, summary, efetch, elink, einfo, epost, and egquery), which are the NCBI application programming interfaces (APIs). In this paper I will focus on using three of the e-utilities to retrieve data from PubMed as opposed to the standard method of manually searching and exporting the information from the PubMed website. I will demonstrate how to use the SAS HTTP procedure along with the XML mapper to develop a SAS^® macro that generates all the required data from PubMed based on search term parameters. Using SAS to extract information from PubMed searches and save it directly into a data set allows the process to be automated (which is good for running routine updates) and eliminates manual searching and exporting from PubMed.

View the e-poster or slides (PDF)

In an effort to increase transparency and accountability in the US health care system, the Obama administration mandated the Centers for Medicare & Medicaid Services (CMS) to make available data for use by researchers and interested parties from the general public. Among the more well-known uses of this data are analyses published by the Wall Street Journal showing that a large, and in some cases, shocking discrepancy between what hospitals potentially charge the uninsured and what they are paid by Medicare for the same procedure. Analyses such as these highlight both potential inequities in the US health care system and, more importantly, potential opportunities for its reform. However, while capturing the public imagination, analyses such as these are but one means to capitalize on the remarkable wealth of information this data provides. Specifically, data from the public distribution CMS data can help both researchers and the public better understand the burden specific conditions and medical treatments place on the US health care system. It was this simple, but important objective that motivated the present study. Our specific analyses focus on two of what we believe to be important questions. First, using the total number of hospital discharges as a proxy for incidence of a condition or treatment, which have the highest incidence rates nationally? Does their incidence remain stable, or is it increasing/decreasing? And, is there variability in these incidence rates across states? Second, as psychologists, we are necessarily interested in understanding the state of mental health care. To date, and to the best of our knowledge, there has been no study utilizing the public inpatient Medicare provider utilization and payment data set to explore the utilization of mental illness services funded by Medicare.

Read the paper (PDF)

Hospital Episode Statistics (HES) is a data warehouse that contains records of all admissions, outpatient appointments, and accident and emergency (A&E) attendances at National Health Service (NHS) hospitals in England. Each year it processes over 125 million admitted patient, outpatient, and A&E records. Such a large data set gives endless research opportunities for researchers and health-care professionals. However, patient care data is complex and might be difficult to manage. This paper demonstrates the flexibility and power of SAS^® programming tools such as the DATA step, the SQL procedure, and macros to help to analyze HES data.

Read the paper (PDF)

Breast cancer is the second leading cause of cancer deaths among women in the United States. Although mortality rates have been decreasing over the past decade, it is important to continue to make advances in diagnostic procedures as early detection vastly improves chances for survival. The goal of this study is to accurately predict the presence of a malignant tumor using data from fine needle aspiration (FNA) with visual interpretation. Compared with other methods of diagnosis, FNA displays the highest likelihood for improvement in sensitivity. Furthermore, this study aims to identify the variables most closely associated with accurate outcome prediction. The study utilizes the Wisconsin Breast Cancer data available within the UCI Machine Learning Repository. The data set contains 699 clinical case samples (65.52% benign and 34.48% malignant) assessing the nuclear features of fine needle aspirates taken from patients' breasts. The study analyzes a variety of traditional and modern models, including: logistic regression, decision tree, neural network, support vector machine, gradient boosting, and random forest. Prior to model building, the weights of evidence (WOE) approach was used to account for the high dimensionality of the categorical variables after which variable selection methods were employed. Ultimately, the gradient boosting model utilizing a principal component variable reduction method was selected as the best prediction model with a 2.4% misclassification rate, 100% sensitivity, 0.963 Kolmogorov-Smirnov statistic, 0.985 Gini coefficient, and 0.992 ROC index for the validation data. Additionally, the uniformity of cell shape and size, bare nuclei, and bland chromatin were consistently identified as the most important FNA characteristics across variable selection methods. These results suggest that future research should attempt to refine the techniques used to determine these specific model inputs. Greater accuracy in characterizing the FNA attributes will allow researchers to develop more promising models for early detection.

Read the paper (PDF) | View the e-poster or slides (PDF)

While cardiac revascularization procedures like cardiac catheterization, percutaneous transluminal angioplasty, and cardiac artery bypass surgery have become standard practices in restorative cardiology, the practice is not evenly prescribed or subscribed to. We analyzed Florida hospital discharge records for the period 1992 to 2010 to determine the odds of receipt of any of these procedures by Hispanics and non-Hispanic Whites. Covariates (potential confounders) were age, insurance type, gender, and year of discharge. Additional covariates considered included comorbidities such as hypertension, diabetes, obesity, and depression. The results indicated that even after adjusting for covariates, Hispanics in Florida during the time period 1992 to 2010 were consistently less likely to receive these procedures than their White counterparts. Reasons for this phenomenon are discussed.

Read the paper (PDF)

There is a growing recognition that mental health is a vital public health and development issue worldwide. Considerable studies have reported that there are close interactions between poverty and mental illness. The association between mental illness and poverty is cyclic and negative. Impoverished people are generally more prone to mental illness and are less capable to afford treatment. Likewise, people with mental health problems are more likely to be in poverty. The availability of free mental health treatment to people in poverty is critical to break this vicious cycle. Based on this hypothesis, a model was developed based on the responses provided by mental health facilities to a federally supported survey. We examined if we can predict whether the mental health facilities will offer free treatment to the patients who cannot afford treatment costs in the United States. About a third of the 9,076 mental health facilities who responded to the survey stated that they offer free treatment to the patients incapable of paying. Different machine learning algorithms and regression models were assessed to predict correctly which facility would offer treatment at no cost. Using a neural network model in conjunction with a decision tree for input variable selection, we found that the best performing model can predict the mental health facilities with an overall accuracy of 71.49%. Sensitivity and specificity of the selected neural network model was 82.96 and 56.57% respectively. The top five most important covariates that explained the model's predictive power are: ownership of mental health facilities, whether facilities provide a sliding fee scale, the type of facility, whether facilities provide mental health treatment service in Spanish, and whether facilities offer smoking cession services. More detailed spatial and descriptive analysis of those key variables will be conducted.

Read the paper (PDF)

Disease prevalence is one of the most basic measures of the burden of disease in the field of epidemiology. As an estimate of the total number of cases of disease in a given population, prevalence is a standard in public health analysis. The prevalence of diseases in a given area is also frequently at the core of governmental policy decisions, charitable organization funding initiatives, and countless other aspects of everyday life. However, all too often, prevalence estimates are restricted to descriptive estimates of population characteristics when they could have a much wider application through the use of inferential statistics. As an estimate based on a sample from a population, disease prevalence can vary based on random fluctuations in that sample rather than true differences in the population characteristic. Statistical inference uses a known distribution of this sampling variation to perform hypothesis tests, calculate confidence intervals, and perform other advanced statistical methods. However, there is no agreed-upon sampling distribution of the prevalence estimate. In cases where the sampling distribution of an estimate is unknown, statisticians frequently rely on the bootstrap re-sampling procedure first given by Efron in 1979. This procedure relies on the computational power of software to generate repeated pseudo-samples similar in structure to an original, real data set. These multiple samples allow for the construction of confidence intervals and statistical tests to make statistical determinations and comparisons using the estimated prevalence. In this paper, we use the bootstrapping capabilities of SAS^® 9.4 to compare statistically the difference between two given prevalence rates. We create a bootstrap analog to the two-sample t test to compare prevalence rates from two states despite the fact that the sampling distribution of these estimates is unknown using SAS^®.

Read the paper (PDF)

The increasing popularity and affordability of wearable devices, together with their ability to provide granular physical activity data down to the minute, have enabled researchers to conduct advanced studies on the effects of physical activity on health and disease. This provides statistical programmers the challenge of processing data and translating it into analyzable measures. One such measure is the number of time-specific bouts of moderate to vigorous physical activity (MVPA) (similar to exercise), which is needed to determine whether the participant meets current physical activity guidelines (for example, 150 minutes of MVPA per week performed in bouts of at least 20 minutes). In this paper, we illustrate how we used SAS^® arrays to calculate the number of 20-minute bouts of MVPA per day. We provide working code on how we processed Fitbit Flex data from 63 healthy volunteers whose physical activities were monitored daily for a period of 12 months.