Student Symposium Papers A-Z

D
Session 2024-2017:
Dataninjas: Modeling Life Insurance Risk
We modeled an eight-level ordinal life insurance risk response on a pre-cleansed and pre-normalized Prudential data set. The data set consists of 59,381 observations and 128 predictors, of which 13 were continuous, 5 discrete, and the remainder categorical. The overall objective of the project was to develop a scoring formula to simplify the life insurance application process in order to encourage more customers to apply for, and therefore purchase, life insurance. Comparison of average square errors (ASEs), misclassification rates, lift, and relative parsimony led us to choose a 13-predictor logistic regression model from a pool of nine candidates. Although the model, in which Body Mass Index (BMI) figures prominently, is globally better than chance at classifying applicants, its misclassification error rates for response levels lower than the highest level (representing lowest insurance risk) are higher than 50 percent. The high error rates call for additional data, subject-matter expertise, and further work to refine the model.
Read the paper (PDF)
David Allen, Kennesaw State University
Seung Lee, Kennesaw State University
F
Session 2021-2017:
Flow Riders Driving Below Traffic Flow: A Risk Analysis
Objective: To assess the risk of dying in a severe traffic accident (where at least one death occurred) among slow drivers, defined as those driving _ 15% below traffic flow. Methods: Records of severe traffic accidents were acquired from Fatality Analysis Reporting System (FARS); interstate and US-highway traffic flow speeds in California were acquired from the California Department of Transportation (Caltrans). Each accident involving at least two vehicles was matched to the nearest available speed monitoring station to assess how slow or fast the vehicles were relative to traffic flow. The outcome was whether the driver died in the accident. To control for external confounders such as weather and road conditions, a conditional logistic regression model was used to stratify the vehicles by accidents. Covariates of interests included those describing the drivers, the vehicles, and the accidents. Results: In the final multivariate model, slow drivers were a significant predictor of death in severe accidents when compared to drivers traveling at traffic flow (OR = 2.41, 95% CI: 1.39, 4.18), after adjusting for vehicle type, extent of vehicle damage, and alcohol use. Conclusion: Slow driving speed puts the driver at higher risk of dying in a severe traffic accident than those driving at traffic flow.
Read the paper (PDF)
Zhongjie Cai, University of Southern California
Dixin Shen, USC
Ken Chau, University of Southern California
J
Session 2027-2017:
J2SP: An Investigation in Social Factors That Might Influence National GDP
This paper details analysis that was conducted on the World Development Indicators data set, obtained from the World Bank Information Repository. The aim of this analysis was to provide useful insight into how countries, particularly developing countries such as those in South America and Asia, can use social investment programs to grow their GDP. The analysis identified that useful models can be obtained to predict per capita GDP based on social factors. Further study and investigation is recommended to explore further the relationships discovered in performing this analysis.
Read the paper (PDF)
John Eacott, TXU Energy
Parselvan Aravazhi, Oklahoma State University
Sid Grover, Oklahoma State University
Jayant Sharma, Oklahoma State University
K
Session 2022-2017:
KA Team: Crime in the City of Philadelphia
In doing this project, we hoped to find patterns in crime occurrences that can help law enforcement officials decrease the number of crimes that the City of Philadelphia experiences. More specifically, we wanted to determine when and where most crimes occur, and what types of crimes were most prevalent. Our data came from the open data repository for Philadelphia, which has additional data from other sources in the region. After merging four data sets into one, doing a lot of data cleaning, and creating new variables, we found some interesting trends. First, we found that crimes generally occurred in more isolated areas where there was less traffic, and that certain locations had higher crime counts than others. We also discovered there was less crime in the morning and more in the afternoon, as well as less crime on Sunday and more on Tuesday. During the summer months, total crime occurrences as well as the most prevalent types of crime occurrences (thefts, vandalism/criminal mischief, miscellaneous crimes, and other assault) peaked. Theft occurrences, the most prevalent crime occurrence, showed many of the same trends as overall crime occurrences. We found that thefts and vehicle thefts as well as overall crime occurrences were most prevalent in ZIP code 19102. The models we built to try to classify a crime as violent or nonviolent were not very fruitful, but the tree model was the best in terms of validation misclassification error rate.
Read the paper (PDF)
Edwin Baidoo, Kennesaw State University
Christina Jones, Kennesaw State University
Muniza Naqvi, Kennesaw State University
S
Session 2028-2017:
SAS® Masters: Exploratory Analysis of the Factors Related to Gun Mortality
Every year, a tragically high number of Americans are killed in a gun-related accident, suicide, or homicide. With the idea that many of these deaths could have easily been prevented or are the result of complex social issues, the topic of gun mortality has recently become more prevalent in our society. Through our analysis, we focus on key demographic variables such as race, age, marital status, education, and sex to see how gun mortality trends vary among different groups of people. Statistical procedures used include logistic regression, random forests procedure, chi-square tests, and multiple graphs to present the primarily categorical data in a meaningful way. This analysis can provide useful foundational knowledge for policy leaders, gun owners, and public policy leaders, so that gun and firearm reform can be approached in the most efficient, impactful way. We hope to inspire others to look deeper into the issue of gun mortality that plagues our nation today.
Read the paper (PDF)
Stephanie Mendoza, California Polytechnic State University, San Luis Obispo
Gabrielle Ilenstine, California Polytechnic State University, SLO
T
Session 2023-2017:
The Flamingos: NFL Data Analytics For A New Era
As statistics students striving to discover new impacts that can be made in a data-driven world, we applied our trade to a modern topic. Studying a sport that owns a day of the week and learning how variables can influence any given series or result in a game can lead to a much larger impact. Using Base SAS®, we used predictive analysis methods to determine the chance any given team would win a game versus a given opponent. To take it a step further, we deciphered which decision should really be made by a coach on fourth down and how that stacked up to what they actually did. With information like this, the football world might soon see an impact on how people play the game.
Read the paper (PDF)
Jonah Muresan, California Polytechnic State University
Daniel Savage, Cal Poly
Gus Moir, California Polytechnic State University
Session 2029-2017:
The Three Amigos Factors: Determining Term Deposit Purchases: How a Bank Can Get Other People's Money
This paper has two goals: 1) Determine which client factors have the highest influence on whether a client purchases a term deposit; 2) Determine the levels of those influential client factors that produce the most term deposit purchases. Achievement of these goals can aid a bank in gaining operating capital by targeting clients that are more likely to make term deposit purchases. Since the target response variable was binary in nature, a logistic regression model and binary decision tree model were used to analyze the marketing campaign data. The ROC curves and fit statistics of the logistic regression model and decision tree were compared to see which model fit the data best. The logistic regression model was the optimal model with a higher area under the ROC curve and a lower misclassification rate. Per the logistic regression model, the three factors that had the largest impact on term deposit purchases were: the type of job the client had, whether a client had credit in default, and whether the client had a personal loan. It was concluded that banks should focus on selling term deposits to clients that display levels of these three factors that lead to the most probable term deposit purchases.
Read the paper (PDF)
Gina Colaianni, Kennesaw State University
Bogdan Gadidov, Kennesaw State University
Matthew Mitchell, Kennesaw State University
back to top