Modeling the Census Bureau Data to Predict the Health Insurance Enrollment Status of Individuals in the United States
In this project, we predicted the health insurance enrollments in three states via machine learning techniques after President Donald J. Trump’s executive order to eliminate the health insurance requirements on individuals. We used micro level data from the American Community Survey (ACS) for California, New York and New Jersey since they have the highest urbanization index. We used three classifiers: Naive Bayes, Logistic Regression and Neural Network. Our results show that we can predict who will keep insurance coverage and who will not with an accuracy over 90 percent. This will help policy makers and institutions such as hospitals, insurance companies, and pharmacies in making informed decisions related to change in the number of insurance holders in absence of insurance enrollment requirement starting from 2019.
To do our analysis, we considered the data features such as states, age, income level, whether eligible for food-stamp, marital status, race, education level, and insurance status. The 'insurance status' was considered the class label in this study. Our model was trained and tested using the census data of 2012 and 2019, respectively. We made sure that both the datasets have the exact similar features hence referred as a dataset in this manuscript. We collected for three states, namely, California, New York and New Jersey as these states have the highest urbanization index. Since the dataset contains both the categorical and numerical values, the feature values were encoded into the binary values for our model. We also adjusted income level for inflation using Consumer Price Index (CPI) released by the Census Bureau
Naive Bayes, Logistic Regression and Multi-layer Perceptron.
In this study we developed three classifiers to predict the health insurance enrollment status of individuals. These classifiers are Naive Bayes, Logistic Regression and Neural Network. Overall, the Logistic Regression classifier performed better than the other two classifiers (i.e., Naive Bayes and Neural Network). The study is critical to identify who will continue to keep their health insurance and who will not. As the population ages and the health care cost rises, our model provides critical information that will be helpful for policy makers and practitioners. The penalty for not having health insurance was removed effective 2019. Both hospitals, insurance companies, pharmacies, urgent care facilities, and decision makers would like to predict the change in the number of insurance holders. This will have a huge impact on the cost of health care as well as insurance premiums. Our model predicts who will keep insurance coverage and who will not with an accuracy over 90 percent. Specifically, 92 percent of our predictions, either have or not have insurance, came out to be accurate in logistic regression.Our F1 scores, which is the combined score of precision and recall, are 96 percent for logistic regression. These results are promising as we show that in absence of mandatory health insurance coverage, policies can be made based on highly precise quantitative analysis.
We faced a few challenges during this project. First of all, we understand the impact of the pre-existing health condition on the insurance enrollment. However due to the nature of the data being confidential we could not look at this variable. Another challenge was the size of the data. Given that the data size was too large to operate, we decided to work on three states instead of all of the states in the US.