This project utilizes a random forest classifier model to predict diabetes based on data from the CDC Health Survey. The data is preprocessed to handle missing values, and variables are named and grouped to find correlations. Various models, including Random Forest and XGBoost, are employed and optimized to achieve the best predictive performance.
Note: This project was done on Google Colab
The dataset used: https://www.kaggle.com/datasets/cdc/national-health-and-nutrition-examination-survey
Credits to Toby Anderson for helping with Data preprocessing https://www.kaggle.com/code/tobyanderson/health-survey-analysis