Add Flexible Missing Data Imputation Functionality #123
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces a robust and flexible impute_missing_values function to the Koheesio framework, enabling users to handle missing data efficiently. The function supports multiple imputation strategies, providing versatility for different use cases.
##Key Features:
Imputation Strategies:
Mean: Replaces missing values with the column's mean.
Median: Replaces missing values with the column's median.
Mode: Replaces missing values with the column's mode.
Constant: Allows users to replace missing values with a specified constant.
Error Handling:
Raises a ValueError if a specified column does not exist in the DataFrame.
Validates the imputation strategy and ensures the fill_value is provided for the "constant" strategy.
Scalable Design:
Designed to handle multiple columns at once, making it efficient for large datasets.
Flexible to extend with additional strategies in the future.
Tests Included:
Comprehensive unit tests have been added to validate the functionality and ensure robust error handling:
Mean Imputation: Validates the replacement of missing values with the mean.
Median Imputation: Checks correctness for median-based replacement.
Mode Imputation: Ensures mode-based imputation works as expected.
Constant Imputation: Tests custom value replacement for missing values.
Error Handling: Verifies that invalid columns raise appropriate errors.
Impact:
This feature significantly enhances the data preprocessing capabilities of the Koheesio framework.
It provides a flexible and reliable method to handle missing data, a critical step in building robust data pipelines.