This repository contains a modular and scalable machine learning pipeline designed for classification tasks. The code is highly configurable, allowing it to be used with various datasets by simply updating file paths and column configurations. It supports preprocessing, training, evaluation, and visualization, making it a robust solution for data analysis and predictive modeling. Key Features
Dataset Imputation: Supports mean imputation for numerical columns. Implements LOCF (Last Observation Carried Forward) for categorical data. Data Preprocessing: Scales features using StandardScaler. Splits data into training and testing sets. Model Training and Evaluation: Includes Logistic Regression, SVM, and Random Forest models. Performs cross-validation and computes evaluation metrics (accuracy, ROC curves, etc.). Hyperparameter Tuning: Ready to integrate RandomizedSearchCV or GridSearchCV for optimization. Visualization: Generates ROC curves for model performance comparison. Plots feature importance for tree-based models. Flexible Configuration: Parameterized file paths, target column, and imputation strategies make the pipeline adaptable for any dataset. How to Use
Clone the Repository:
git clone
cd your-repository
Set Up Your Environment: Install the required Python libraries:
pip install -r requirements.txt
Update Configuration:
Edit the script to include your dataset's file paths and column names: input_filepath for the input data file. output_filepath for saving the processed dataset. mean_impute_cols for numerical columns requiring mean imputation. locf_impute_cols for categorical columns requiring LOCF imputation. target_col for the name of the target variable.
Run the Script: Execute the script to preprocess your data, train models, and generate visualizations:
python main.py
Requirements
Python 3.7+
Required Libraries:
pandas
numpy
scikit-learn
matplotlib
scipy
Install all dependencies via:
pip install -r requirements.txt
Folder Structure
├── data/
│ ├── input_data.csv # Sample input dataset
│ ├── output_data.csv # Processed dataset after imputation
├── main.py # Main script for running the pipeline
├── requirements.txt # Required libraries
└── README.md # Project documentation
Applications
This pipeline is ideal for:
Healthcare data analysis. General predictive modeling. Rapid prototyping of classification tasks. Feature selection and model comparison.