Although the statistical foundations of predictive analytics have large overlaps, the business objectives, data availability, and regulations are different across the property and casualty insurance, life insurance, banking, pharmaceutical, and genetics industries. A common process in property and casualty insurance companies with large data sets is introduced, including data acquisition, data preparation, variable creation, variable selection, model building (also known as fitting), model validation, and model testing. Variable selection and model validation stages are described in more detail. Some successful models in the insurance companies are introduced. Base SAS®, SAS® Enterprise Guide®, and SAS® Enterprise Miner™ are presented as the main tools for this process.
Mei Najim, Sedgwick
Government funding is a highly sought-after financing medium by entrepreneurs and researchers aspiring to make a breakthrough in the market and compete with larger organizations. The funding is channeled via federal agencies that seek out promising research and products that improve the extant work. This study analyzes the project abstracts that earned the government funding through a text analytic approach over the past three and a half decades in the fields of Defense and Health and Human Services. This helps us understand the factors that might be responsible for awards made to a project. We collected the data of 100,279 records from the Small Business Innovation and Research (SBIR) website (https://www.sbir.gov). Initially, we analyze the trends in government funding over research by the small businesses. Then we perform text mining on the 55,791 abstracts of the projects that are funded by the Department of Defense and the Department of Health. From the text mining, we portray the key topics and patterns related to the past research and determine the changes in their trends over time. Through our study, we highlight the research trends to enable organizations to shape their business strategy and make better investments in the future research.
Sairam Tanguturi, Oklahoma State University
Sagar Rudrake, Oklahoma state University
Nithish Reddy Yeduguri, Oklahoma State University
This paper presents supervised and unsupervised pattern recognition techniques that use Base SAS® and SAS® Enterprise Miner™ software. A simple preprocessing technique creates many small image patches from larger images. These patches encourage the learned patterns to have local scale, which follows well-known statistical properties of natural images. In addition, these patches reduce the number of features that are required to represent an image and can decrease the training time that algorithms need in order to learn from the images. If a training label is available, a classifier is trained to identify patches of interest. In the unsupervised case, a stacked autoencoder network is used to generate a dictionary of representative patches, which can be used to locate areas of interest in new images. This technique can be applied to pattern recognition problems in general, and this paper presents examples from the oil and gas industry and from a solar power forecasting application.
Patrick Hall, SAS
Alex Chien, SAS Institute
ILKNUR KABUL, SAS Institute
JORGE SILVA, SAS Institute
Breast cancer is the second leading cause of cancer deaths among women in the United States. Although mortality rates have been decreasing over the past decade, it is important to continue to make advances in diagnostic procedures as early detection vastly improves chances for survival. The goal of this study is to accurately predict the presence of a malignant tumor using data from fine needle aspiration (FNA) with visual interpretation. Compared with other methods of diagnosis, FNA displays the highest likelihood for improvement in sensitivity. Furthermore, this study aims to identify the variables most closely associated with accurate outcome prediction. The study utilizes the Wisconsin Breast Cancer data available within the UCI Machine Learning Repository. The data set contains 699 clinical case samples (65.52% benign and 34.48% malignant) assessing the nuclear features of fine needle aspirates taken from patients' breasts. The study analyzes a variety of traditional and modern models, including: logistic regression, decision tree, neural network, support vector machine, gradient boosting, and random forest. Prior to model building, the weights of evidence (WOE) approach was used to account for the high dimensionality of the categorical variables after which variable selection methods were employed. Ultimately, the gradient boosting model utilizing a principal component variable reduction method was selected as the best prediction model with a 2.4% misclassification rate, 100% sensitivity, 0.963 Kolmogorov-Smirnov statistic, 0.985 Gini coefficient, and 0.992 ROC index for the validation data. Additionally, the uniformity of cell shape and size, bare nuclei, and bland chromatin were consistently identified as the most important FNA characteristics across variable selection methods. These results suggest that future research should attempt to refine the techniques used to determine these specific model inputs. Greater accuracy in characterizing the FNA attributes will allow researchers to
develop more promising models for early detection.
Josephine Akosa, Oklahoma State University
Using smart clothing with wearable medical sensors integrated to keep track of human health is now attracting many researchers. However, body movement caused by daily human activities inserts artificial noise into physiological data signals, which affects the output of a health monitoring/alert system. To overcome this problem, recognizing human activities, determining relationship between activities and physiological signals, and removing noise from the collected signals are essential steps. This paper focuses on the first step, which is human activity recognition. Our research shows that no other study used SAS® for classifying human activities. For this study, two data sets were collected from an open repository. Both data sets have 561 input variables and one nominal target variable with four levels. Principal component analysis along with other variable reduction and selection techniques were applied to reduce dimensionality in the input space. Several modeling techniques with different optimization parameters were used to classify human activity. The gradient boosting model was selected as the best model based on a test misclassification rate of 0.1233. That is, 87.67% of total events were classified correctly.
Minh Pham, Oklahoma State University
Mary Ruppert-Stroescu, Oklahoma State University
Mostakim Tanjil, Oklahoma State University
In light of past indiscretions by the police departments leading to riots in major U.S. cities, it is important to assess factors leading to the arrest of a citizen. The police department should understand the arrest situation before they can make any decision and diffuse any possibility of a riot or similar incidents. Many such incidents in the past are a result of the police department failing to make right decisions in the case of emergencies. The primary objective of this study is to understand the key factors that impact arrest of people in New York City and understanding various crimes that have taken place in the region in the last two years. The study explores different regions of New York City where the crimes have taken place and also the timing of incidents leading to them. The data set consists of 111 variables and 273,430 observations from the year 2013 to 2014, with a binary target variable arrest made . The percentage of Yes and No incidents of arrest made are 6% and 94% respectively. This study analyzes the reasons for arrest, which are suspicion, timing of frisk activities, identifying threats, basis of search, whether arrest required a physical intervention, location of the arrest, and whether the frisk officer needs any support. Decision tree, regression, and neural network models were built, and decision tree turned out to be the best model with a validation misclassification rate of 6.47% and model sensitivity of 73.17%. The results from the decision tree model showed that suspicion count, search basis count, summons leading to violence, ethnicity, and use of force are some of the important factors influencing arrest of a person.
Karan Rudra, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Maitreya Kadiyala, Oklahoma State University
Building representative machine learning models that generalize well on future data requires careful consideration both of the data at hand and of assumptions about the various available training algorithms. Data are rarely in an ideal form that enables algorithms to train effectively. Some algorithms are designed to account for important considerations such as variable selection and handling of missing values, whereas other algorithms require additional preprocessing of the data or appropriate tweaking of the algorithm options. Ultimate evaluation of a model's quality requires appropriate selection and interpretation of an assessment criterion that is meaningful for the given problem. This paper discusses the most common issues faced by machine learning practitioners and provides guidance for using these powerful algorithms to build effective models.
Brett Wujek, SAS
Funda Gunes, SAS Institute
Patrick Hall, SAS
Higher education is taking on big data initiatives to help improve student outcomes and operational efficiencies. This paper shows how one institution went form White Paper to action. Challenges such as developing a common definition of big data, access to certain types of data, and bringing together institutional groups who had little previous contact are highlighted. The stepwise, collaborative approach taken is highlighted. How the data was eventually brought together and analyzed to generate actionable results is also covered. Learn how to make big data a part of your analytics toolbox.
Stephanie Thompson, Datamum
We set out to model two of the leading causes of fatal automobile accidents in America - drunk driving and speeding. We created a decision tree for each of those causes that uses situational conditions such as time of day and type of road to classify historical accidents. Using this model a law enforcement or town official can decide if a speed trap or DUI checkpoint can be the most effective in a particular situation. This proof of concept can easily be applied to specific jurisdictions for customized solutions.
Catherine LaChapelle, NC State University
Daniel Brannock, North Carolina State University
Aric LaBarr, Institute for Advanced Analytics
Craig Shaver, North Carolina State University
One of the major diseases that records a high number of readmissions is bacterial pneumonia in Medicare patients. This study aims at comparing and contrasting Northeast and South regions of the United States based on the factors causing the 30-day readmissions. The study also identifies some of the ICD-9 medical procedures associated with those readmissions. Using SAS® Enterprise Guide® 7.1 and SAS® Enterprise Miner™ 14.1, this study analyzes patient and hospital demographics of the Northeast and South regions of United States using the Cerner Health Facts Database. Further, the study suggests some preventive measures to reduce readmissions. The 30-day readmissions are computed based on admission and discharge dates from 2003 until 2013. Using clustering, various hospitals, along with discharge disposition levels (where a patient is sent after discharge), are grouped. In both regions, the patients who are discharged to home have shown significantly lower chances of readmission. The odds ratio = 0.562 for the Northeast region and for the South region; all other disposition levels have significantly high odds ( > 1), compared to discharged to home. For the South region, females have around 53 percent more possibility of readmission compared to males (odds ratio = 1.535). Also some of the hospital groups have higher readmission cases. The association analysis using the Market Basket node finds catheterization procedures to be highly significant (Lift = 1.25 for the Northeast and 3.84 for the South; Confidence = 52.94% for the Northeast and 85.71% for the South) in bacterial pneumonia readmissions. By research it was found that during these procedures, patients are highly susceptible to acquiring Methicillin-resistant Staphylococcus aureus (MRSA) bacteria, which causes Methicillin-susceptible pneumonia. Providing timely follow up for the patients operated with these procedures might possibly limit readmissions. These patients might also be discharge
d to home under medical supervision, as such patients had shown significantly lower chances of readmission.
Heramb Joshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Dr. William Paiva, Center for Health Systems Innovation, OSU, Tulsa
Aditya Sharma, Oklahoma State Univesrity
Debt collection! The two words can trigger multiple images in one's mind--mostly harsh. However, let's try to think positively for a moment. In 2013, over $55 billion of debt were past due in the United States. What if all of these debts were left as is and the fate of credit issuers in the hands of good will payments made by defaulters? Well, this is not the most sustainable model, to say the least. In this situation, debt collection comes in as a tool that is employed at multiple levels of recovery to keep the credit flowing. Ranging from in-house to third-party to individual collection efforts, this industry is huge and plays an important role in keeping the engine of commerce running. In the recent past, with financial markets recovering and banks selling fewer charged-off accounts and at higher prices, debt collection has increasingly become a game of efficient operations backed by solid analytics. This paper takes you into the back alleys of all the data that is in there and gives an overview of some ways modeling can be used to impact the collection strategy and outcome. SAS® tools such as SAS® Enterprise Miner™ and SAS® Enterprise Guide® are extensively used for both data manipulation and modeling. Decision trees are given more focus to understand what factors make the most impact. Along the way, this paper also gives an idea of how analytics teams today are slowly trying to get the buy-in from other stakeholders in any company, which surprisingly is one of the most challenging aspects of our jobs.
Karush Jaggi, AFS Acceptance
Harold Dickerson, SquareTwo Financial
Thomas Waldschmidt, SquareTwo Financial
This paper presents an application based on predictive analytics and feature-extraction techniques to develop the alternative method for diagnosis of obstructive sleep apnea (OSA). Our method reduces the time and cost associated with the gold standard or polysomnography (PSG), which is operated manually, by automatically determining the OSA's severity of a patient via classification models using the time series from a one-lead electrocardiogram (ECG). The data is from Dr. Thomas Penzel of Philipps-University, Germany, and can be downloaded at www.physionet.org. The selected data consists of 10 recordings (7 OSAs, and 3 controls) of ECG collected overnight, and non-overlapping-minute-by-minute OSA episode annotations (apnea and non-apnea states). This accounts for a total of 4,998 events (2,532 non-apnea and 2,466 apnea minutes). This paper highlights the nonlinear decomposition technique, wavelet analysis (WA) in SAS/IML® software, to maximize the information of OSA symptoms from ECG, resulting in useful predictor signals. Then, the spectral and cross-spectral analyses via PROC SPECTRA are used to quantify important patterns of those signals to numbers (features), namely power spectral density (PSD), cross power spectral density (CPSD), and coherency, such that the machine learning techniques in SAS® Enterprise Miner™, can differentiate OSA states. To eliminate variations such as body build, age, gender, and health condition, we normalize each feature by the feature of its original signal (that is, ratio of PSD of ECGs WA by PSD of ECG). Moreover, because different OSA symptoms occur at different times, we account for this by taking features from adjacency minutes into analysis, and select only important ones using a decision tree model. The best classification result in the validation data (70:30) obtained from the Random Forest model is 96.83% accuracy, 96.39% sensitivity, and 97.26% specificity. The results suggest our method is well comparable to the gold
standard.
Woranat Wongdhamma, Oklahoma State University
Ensemble models are a popular class of methods for combining the posterior probabilities of two or more predictive models in order to create a potentially more accurate model. This paper summarizes the theoretical background of recent ensemble techniques and presents examples of real-world applications. Examples of these novel ensemble techniques include weighted combinations (such as stacking or blending) of predicted probabilities in addition to averaging or voting approaches that combine the posterior probabilities by adding one model at a time. Fit statistics across several data sets are compared to highlight the advantages and disadvantages of each method, and process flow diagrams that can be used as ensemble templates for SAS® Enterprise Miner™ are presented.
Wendy Czika, SAS
Ye Liu, SAS Institute
Every day, we are bombarded by pundits pushing big data as the cure for all research woes and heralding the death of traditional quantitative surveys. We are told how big data, social media, and text analytics will make the Likert scale go the way of the dinosaur. This presentation makes the case for evolving our surveys and data sets to embrace new technologies and modes while still welcoming new advances in big data and social listening. Examples from the global automotive industry are discussed to demonstrate the pros and cons of different types of data in an automotive environment.
Will Neafsey, Ford Motor Company
Analytics is now pervasive in our society, informing millions of decisions every year, and even making decisions without human intervention in automated systems. The implicit assumption is that all completed analytical projects are accurate and successful, but the reality is that some are not. Bad decisions that result from faulty analytics can be costly, embarrassing and even dangerous. By understanding the root causes of analytical failures, we can minimize the chances of repeating the mistakes of the past.
Oil and gas wells encounter many types of adverse events, including unexpected shut-ins, scaling, rod pump failures, breakthrough, and fluid influx, to name just a few common ones. This paper presents a real-time event detection system that presents visual reports and alerts petroleum engineers when an event is detected in a well. This system uses advanced time series analysis on both high- and low-frequency surface and downhole measurements. This system gives the petroleum engineer the ability to switch from passive surveillance of events, which then require remediation, to an active surveillance solution, which enables the engineer to optimize well intervention strategies. Events can occur simultaneously or rapidly, or can be masked over long periods of time. We tested the proposed method on data received from both simulated and publicly available data to demonstrate how a multitude of well events can be detected by using multiple models deployed into the data stream. In our demonstration, we show how our real-time system can detect a mud motor pressure failure resulting from a latent fluid overpressure event. Giving the driller advanced warning of the motor state and the potential to fail can save operators millions of dollars a year by reducing nonproductive time caused by these failures. Reducing nonproductive time is critical to reducing the operating costs of constructing a well and to improving cash flow by shortening the time to first oil.
Milad Falahi, SAS
Gilbert Hernandez, SAS Institute
Memory-based reasoning (MBR) is an empirical classification method that works by comparing cases in hand with similar examples from the past and then applying that information to the new case. MBR modeling is based on the assumptions that the input variables are numeric, orthogonal to each other, and standardized. The latter two assumptions are taken care of by principal components transformation of raw variables and using the components instead of the raw variables as inputs to MBR. To satisfy the first assumption, the categorical variables are often dummy coded. This raises issues such as increasing dimensionality and overfitting in the training data by introducing discontinuity in the response surface relating inputs and target variables. The Weight of Evidence (WOE) method overcomes this challenge. This method measures the relative response of the target for each group level of a categorical variable. Then the levels are replaced by the pattern of response of the target variable within that category. The SAS® Enterprise Miner™ Interactive Grouping Node is used to achieve this. With this method, the categorical variables are converted into numeric variables. This paper demonstrates the improvement in performance of an MBR model when categorical variables are WOE coded. A credit screening data set obtained from SAS® Education that comprises 25 attributes for 3000 applicants is used for this study. Three different types of MBR models were built using the SAS Enterprise Miner MBR node to check the improvement in performance. The results show the MBR model with WOE coded categorical variables is the best, based on misclassification rate. Using this data, when WOE coding is adopted, the model misclassification rate decreases from 0.382 to 0.344 while the sensitivity of the model increases from 0.552 to 0.572.
Vinoth Kumar Raja, West Corporation
Goutam Chakraborty, Oklahoma State University
Vignesh Dhanabal, Oklahoma State University
Does lyric complexity impact song popularity, and can analysis of the Billboard Top 100 from 1955-2015 be used to evaluate this hypothesis? The music industry has undergone a dramatic change. New technologies enable everyone's voice to be heard and has created new avenues for musicians to share their music. These technologies are destroying entry barriers, resulting in an exponential increase in competition in the music industry. For this reason, optimization that enables musicians and record labels to recognize opportunities to release songs with a likelihood of high popularity is critical. The purpose of this study is to determine whether the complexity of song lyrics impacts popularity and to provide guidance and useful information toward business opportunities for musicians and record labels. One such opportunity is the optimization of advertisement budgets for songs that have the greatest chance for success, at the appropriate time and for the appropriate audience. Our data has been extracted from open-source hosts, loaded into a comprehensive, consolidated data set, and cleaned and transformed using SAS® as the primary tool for analysis. Currently our data set consists of 334,784 Billboard Top 100 observations, with 26,869 unique songs. The type of analyses used includes: complexity-popularity relationship, popularity churn (how often Billboard Top 100 resets), top five complexity-popularity relationships, longevity-complexity relationship, popularity prediction, and sentiment analysis on trends over time. We have defined complexity as the number of words in a song, unique word inclusion (compared to other songs), and repetition of each word in a song. The Billboard Top 100 represents an optimal source of data as all songs on the list have some measure of objective popularity, which enables lyric comparison analysis among chart position.
John Harden, Oklahoma State University
Yang Gao, Oklahoma State University
Vojtech Hrdinka, Oklahoma State University
Chris Linn, Oklahoma State University
It is of paramount importance for brand managers to measure and understand consumer brand associations and the mindspace their brand captures. Brands are encoded in memory on a cognitive and emotional basis. Traditionally, brand tracking has been done by surveys and feedback, resulting in a direct communication that covers the cognitive segment and misses the emotional segment. Respondents generally behave differently under observation and in solitude. In this paper, a new brand-tracking technique is proposed that involves capturing public data from social media that focuses more on the emotional aspects. For conceptualizing and testing this approach, we downloaded nearly one million tweets for three major brands--Nike, Adidas, and Reebok--posted by users. We proposed a methodology and calculated metrics (benefits and attributes) using this data for each brand. We noticed that generally emoticons are not used in sentiment mining. To incorporate them, we created a macro that automatically cleans the tweets and replaces emoticons with an equivalent text. We then built supervised and unsupervised models on those texts. The results show that using emoticons improves the efficiency of predicting the polarity of sentiments as the misclassification rate was reduced from 0.31 to 0.24. Using this methodology, we tracked the reactions that are triggered in the minds of customers when they think about a brand and thus analyzed their mind share.
Sharat Dwibhasi, Oklahoma State University
Fast detection of forest fires is a great concern among environmental experts and national park managers because forest fires create economic and ecological damage and endanger human lives. For effective fire control and resource preparation, it is necessary to predict fire occurrences in advance and estimate the possible losses caused by fires. For this purpose, using real-time sensor data of weather conditions and fire occurrence are highly recommended in order to support the predicting mechanism. The objective of this presentation is to use SAS® 9.4 and SAS® Enterprise Miner™ 14.1 to predict the probability of fires and to figure out special weather conditions resulting in incremental burned areas in Montesinho Park forest (Portugal). The data set was obtained from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine and contains 517 observations and 13 variables from January 2000 to December 2003. Support Vector Machine analyses with variable selection were performed on this data set for fire occurrence prediction with a validation accuracy of approximately 60%. The study also incorporates Incremental Response technique and hypothesis testing to estimate the increased probability of fire, as well as the extra burned area under various conditions. For example, when there is no rain, a 27% higher chance of fires and 4.8 hectares of extra burned area are recorded, compared to when there is rain.
Quyen Nguyen, Oklahoma State University
Due to advances in medical care and the rise in living standards, life expectancy increased on average to 79 years in the US. This resulted in an increase in aging populations and increased demand for development of technologies that aid elderly people to live independently and safely. It is possible to achieve this challenge through ambient-assisted living (AAL) technologies that help elderly people live more independently. Much research has been done on human activity recognition (HAR) in the last decade. This research work can be used in the development of assistive technologies, and HAR is expected to be the future technology for e-health systems. In this research, I discuss the need to predict human activity accurately by building various models in SAS® Enterprise Miner™ 14.1 on a free public data set that contains 165,633 observations and 19 attributes. Variables used in this research represent the metrics of accelerometers mounted on the waist, left thigh, right arm, and right ankle of four individuals performing five different activities, recorded over a period of eight hours. The target variable predicts human activity such as sitting, sitting down, standing, standing up, and walking. Upon comparing different models, the winner is an auto neural model whose input is taken from stepwise logistic regression, which is used for variable selection. The model has an accuracy of 98.73% and sensitivity of 98.42%.
Venkata Vennelakanti, Oklahoma State University
There is a growing recognition that mental health is a vital public health and development issue worldwide. Considerable studies have reported that there are close interactions between poverty and mental illness. The association between mental illness and poverty is cyclic and negative. Impoverished people are generally more prone to mental illness and are less capable to afford treatment. Likewise, people with mental health problems are more likely to be in poverty. The availability of free mental health treatment to people in poverty is critical to break this vicious cycle. Based on this hypothesis, a model was developed based on the responses provided by mental health facilities to a federally supported survey. We examined if we can predict whether the mental health facilities will offer free treatment to the patients who cannot afford treatment costs in the United States. About a third of the 9,076 mental health facilities who responded to the survey stated that they offer free treatment to the patients incapable of paying. Different machine learning algorithms and regression models were assessed to predict correctly which facility would offer treatment at no cost. Using a neural network model in conjunction with a decision tree for input variable selection, we found that the best performing model can predict the mental health facilities with an overall accuracy of 71.49%. Sensitivity and specificity of the selected neural network model was 82.96 and 56.57% respectively. The top five most important covariates that explained the model's predictive power are: ownership of mental health facilities, whether facilities provide a sliding fee scale, the type of facility, whether facilities provide mental health treatment service in Spanish, and whether facilities offer smoking cession services. More detailed spatial and descriptive analysis of those key variables will be conducted.
Ram Poudel, Oklahoma state university
Lynn Xiang, Oklahoma State University
Years ago, doctors advised women with autoimmune diseases such as systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) not to become pregnant for fear of maternal health. Now, it is known that healthy pregnancy is possible for women with lupus but at the expense of a higher pregnancy complication rate. The main objective of this research is to identify key factors contributing to these diseases and to predict the occurrence rate of SLE and RA in pregnant women. Based on the approach used in this study, the prediction of adverse pregnancy outcomes for women with SLE, RA, and other diseases such as diabetes mellitus (DM) and antiphospholipid antibody syndrome (APS) can be carried out. These results will help pregnant women undergo a healthy pregnancy by receiving proper medication at an earlier stage. The data set was obtained from Cerner Health Facts data warehouse. The raw data set contains 883,473 records and 85 variables such as diagnosis code, age, race, procedure code, admission date, discharge date, total charges, and so on. Analyses were carried out with two different data sets--one for SLE patients and the other for RA patients. The final data sets had 398,742 and 397,898 records each for modeling SLE and RA patients, respectively. To provide an honest assessment of the models, the data was split into training and validation using the data partition node. Variable selection techniques such as LASSO, LARS, stepwise regression, and forward regression were used. Using a decision tree, prominent factors that determine the SLE and RA occurrence rate were identified separately. Of all the predictive models run using SAS® Enterprise Miner™ 12.3, the model comparison node identified the decision tree (Gini) as the best model with the least misclassification rate of 0.308 to predict the SLE patients and 0.288 to predict the RA patients.
Ravikrishna Vijaya Kumar, Oklahoma State University
SHRIE RAAM SATHYANARAYANAN, Oklahoma State University - Center for Health Sciences
In early 2006, the United States experienced a housing bubble that affected over half of the American states. It was one of the leading causes of the 2007-2008 financial recession. Primarily, the overvaluation of housing units resulted in foreclosures and prolonged unemployment during and after the recession period. The main objective of this study is to predict the current market value of a housing unit with respect to fair market rent, census region, metropolitan statistical area, area median income, household income, poverty income, number of units in the building, number of bedrooms in the unit, utility costs, other costs of the unit, and so on, to determine which factors affect the market value of the housing unit. For the purpose of this study, data was collected from the Housing Affordability Data System of the US Department of Housing and Urban Development. The data set contains 20 variables and 36,675 observations. To select the best possible input variables, several variable selection techniques were used. For example, LARS (least angle regression), LASSO (least absolute shrinkage and selection operator), adaptive LASSO, variable selection, variable clustering, stepwise regression, (PCA) principal component analysis only with numeric variables, and PCA with all variables were all tested. After selecting input variables, numerous modeling techniques were applied to predict the current market value of a housing unit. An in-depth analysis of the findings revealed that the current market value of a housing unit is significantly affected by the fair market value, insurance and other costs, structure type, household income, and more. Furthermore, a higher household income and median income of an area are associated with a higher market value of a housing unit.
Mostakim Tanjil, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Donorschoose.org is a nonprofit organization that allows individuals to donate directly to public school classroom projects. Teachers from public schools post a request for funding a project with a short essay describing it. Donors all around the world can look at these projects when they log in to Donorschoose.org and donate to projects of their choice. The idea is to have a personalized recommendation webpage for all the donors, which will show them the projects, which they prefer, like and love to donate. Implementing a recommender system for the DonorsChoose.org website will improve user experience and help more projects meet their funding goals. It also will help us in understanding the donors' preferences and delivering to them what they want or value. One type of recommendation system can be designed by predicting projects that will be less likely to meet funding goals, segmenting and profiling the donors and using that information for recommending right projects when the donors log in to DonorsChoose.org.
Heramb Joshi, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Sandeep Chittoor, Student
Vignesh Dhanabal, Oklahoma State University
Sharat Dwibhasi, Oklahoma State University
Coronal mass ejections (CMEs) are massive explosions of magnetic field and plasma from the Sun. While responsible for the northern lights, these eruptions can cause geomagnetic storms and cataclysmic damage to Earth's telecommunications systems and power grid infrastructures. Hence, it is imperative to construct highly accurate predictive processes to determine whether an incoming CME will produce devastating effects on Earth. One such process, called stacked generalization, trains a variety of models, or base-learners, on a data set. Then, using the predictions from the base-learners, another model is trained to learn from the metadata. The goal of this meta-learner is to deduce information about the biases from the base-learners to make more accurate predictions. Studies have shown success in using linear methods, especially within regularization frameworks, at the meta-level to combine the base-level predictions. Here, SAS® Enterprise Miner™ 13.1 is used to reinforce the advantages of regularization via the Least Absolute Shrinkage and Selection Operator (LASSO) on this type of metadata. This work compares the LASSO model selection method to other regression approaches when predicting the occurrence of strong geomagnetic storms caused by CMEs.
Taylor Larkin, The University of Alabama
Denise McManus, University of Alabama
With the ever growing global tourism sector, the hotel industry is also flourishing by leaps and bounds, and the standards of quality services by the tourists are also increasing. In today's cyber age where many on the Internet become part of the online community, word of mouth has steadily increased over time. According to a recent survey, approximately 46% of travelers look for online reviews before traveling. A one-star rating by users influences the peer consumer more than the five-star brand that the respective hotel markets itself to be. In this paper, we do the customer segmentation and create clusters based on their reviews. The next process relates to the creation of an online survey targeting the interests of the customers of those different clusters, which helps the hoteliers identify the interests and the hospitality expectations from a hotel. For this paper, we use a data set of 4096 different hotel reviews, with a total of 60,239 user reviews.
Saurabh Nandy, Oklahoma State University
Neha Singh, Oklahoma State University
Cancer is the second-leading cause of deaths in the United States. About 10% to 15% of all lung cancers are non-small cell lung cancer, but they constitute about 26.8% of cancer deaths. An efficient cancer treatment plan is therefore of prime importance for increasing the survival chances of a patient. Cancer treatments are generally given to a patient in multiple sittings and often doctors tend to make decisions based on the improvement over time. Calculating the survival chances of the patient with respect to time and determining the various factors that influence the survival time would help doctors make more informed decisions about the further course of treatment and also help patients develop a proactive approach in making choices for the treatment. The objective of this paper is to analyze the survival time of patients suffering from non-small cell lung cancer, identify the time interval crucial for their survival, and identify various factors such as age, gender, and treatment type. The performances of the models built are analyzed to understand the significance of cubic splines used in these models. The data set is from the Cerner database with 548 records and 12 variables from 2009 to 2013. The patient records with loss to follow-up are censored. The survival analysis is performed using parametric and nonparametric methods in SAS® Enterprise Miner™ 13.1. The analysis revealed that the survival probability of a patient is high within two weeks of hospitalization and the probability of survival goes down by 60% between weeks 5 and 9 of admission. Age, gender, and treatment type play prominent roles in influencing the survival time and risk. The probability of survival for female patients decreases by 70% and 80% during weeks 6 and 7, respectively, and for male patients the probability of survival decreases by 70% and 50% during weeks 8 and 13, respectively.
Raja Rajeswari Veggalam, Oklahoma State University
Goutam Chakraborty, Oklahoma State University
Akansha Gupta, Oklahoma State University
A survey is often the best way to obtain information and feedback to help in program improvement. At a bare minimum, survey research comprises economics, statistics, and psychology in order to develop a theory of how surveys measure and predict important aspects of the human condition. Through healthcare-related research papers written by experts in the industry, I hypothesize a list of several factors that are important and necessary for patients during hospital visits. Through text mining using SAS® Enterprise Miner™ 12.1, I measured tangible aspects of hospital quality and patient care by using online survey responses and comments found on the website for the National Health Service in England. I use these patient comments to determine the majority opinion and whether it correlates with expert research on hospital quality. The implications of this research are vital in comprehending our health-care system and the factors that are important in satisfying patient needs. Analyzing survey responses can help us to understand and mitigate disparities in health-care services provided to population subgroups in the United States. Starting with online survey responses can help us to understand the overall methodology and motivation needed to analyze the surveys that have been developed for departments like the Centers for Medicare and Medicaid Services (CMS).
Divya Sridhar, Deloitte
In this session, we examine the use of text analytics as a tool for strategic analysis of an organization, a leader, a brand, or a process. The software solutions SAS® Enterprise Miner™ and Base SAS® are used to extract topics from text and create visualizations that identify the relative placement of the topics next to various business entities. We review a number of case studies that identify and visualize brand-topic relationships in the context of branding, leadership, and service quality.
Nicholas Evangelopoulos, University of North Texas
All public schools in the United States require health and safety education for their students. Furthermore, almost all states require driver education before minors can obtain a driver's license. Through extensive analysis of the Fatality Analysis Reporting System data, we have concluded that from 2011-2013 an average of 12.1% of all individuals killed in a motor vehicle accident in the United States, District of Columbia, and Puerto Rico were minors (18 years or younger). Our goal is to offer insight within our analysis in order to better road safety education to prevent future premature deaths involving motor vehicles.
Molly Funk, Bryant University
Max Karsok, Bryant University
Michelle Williams, Bryant University
Someone has aptly said, Las Vegas looks the way one would imagine heaven must look at night. What if you know the secret to run a plethora of various businesses in the entertainment capital of the world? Nothing better, right? Well, we have what you want, all the necessary ingredients for you to precisely know what business to target in a particular locality of Las Vegas. Yelp, a community portal, wants to help people finding great local businesses. They cover almost everything from dentists and hair stylists through mechanics and restaurants. Yelp's users, Yelpers, write reviews and give ratings for all types of businesses. Yelp then uses this data to make recommendations to the Yelpers about which institutions best fit their individual needs. We have the yelp academic data set comprising 1.6 million reviews and 500K tips by 366K in 61K businesses across several cities. We combine current Yelp data from all the various data sets for Las Vegas to create an interactive map that provides an overview of how a business runs in a locality and how the ratings and reviews tickers a business. We answer the following questions: Where is the most appropriate neighborhood to start a new business (such as cafes, bars, and so on)? Which category of business has the greatest total count of reviews that is the most talked about (trending) business in Las Vegas? How does a business' working hours affect the customer reviews and the corresponding rating of the business? Our findings present research for further understanding of perceptions of various users, while giving reviews and ratings for the growth of a business by encompassing a variety of topics in data mining and data visualization.
Anirban Chakraborty, Oklahoma State University