Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Flexible Missing Data Imputation Functionality #123

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

shubhamnaikk
Copy link

Description

This pull request introduces a robust and flexible impute_missing_values function to the Koheesio framework, enabling users to handle missing data efficiently. The function supports multiple imputation strategies, providing versatility for different use cases.

##Key Features:
Imputation Strategies:
Mean: Replaces missing values with the column's mean.
Median: Replaces missing values with the column's median.
Mode: Replaces missing values with the column's mode.
Constant: Allows users to replace missing values with a specified constant.
Error Handling:
Raises a ValueError if a specified column does not exist in the DataFrame.
Validates the imputation strategy and ensures the fill_value is provided for the "constant" strategy.
Scalable Design:
Designed to handle multiple columns at once, making it efficient for large datasets.
Flexible to extend with additional strategies in the future.

Tests Included:

Comprehensive unit tests have been added to validate the functionality and ensure robust error handling:

Mean Imputation: Validates the replacement of missing values with the mean.
Median Imputation: Checks correctness for median-based replacement.
Mode Imputation: Ensures mode-based imputation works as expected.
Constant Imputation: Tests custom value replacement for missing values.
Error Handling: Verifies that invalid columns raise appropriate errors.

Impact:

This feature significantly enhances the data preprocessing capabilities of the Koheesio framework.
It provides a flexible and reliable method to handle missing data, a critical step in building robust data pipelines.

@shubhamnaikk shubhamnaikk requested a review from a team as a code owner November 25, 2024 03:18
Copy link
Member

@dannymeijer dannymeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @shubhamnaikk , really appreciate it.

A few points:

  • The placement of your modules does not quite fit with how we intend, it would be better if we placed these elsewhere;
  • The way you wrote the code, it is not making use of the Koheesio framework (Step classes) along with the typing that we intend to enforce.
  • Currently we have no module for pure python implementations (Transformation in your case) - we should add that as part of your PR if that is the intended use.

Also, can you explain the intended use for this a bit more? Would this be for ML use-cases with the input being something like pandas perhaps? Or did you have something else in mind.

I propose that we have a small meetup to discuss, as I would love to add your contribution to our library. Please reach out to me in a DM / email so we can discuss further.

@dannymeijer
Copy link
Member

Please also see: #129

@dannymeijer
Copy link
Member

There has not been any response for the last 2 weeks. Please respond or address the aforementioned concerns before the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants