Add Flexible Missing Data Imputation Functionality #123

shubhamnaikk · 2024-11-25T03:18:23Z

Description

This pull request introduces a robust and flexible impute_missing_values function to the Koheesio framework, enabling users to handle missing data efficiently. The function supports multiple imputation strategies, providing versatility for different use cases.

##Key Features:
Imputation Strategies:
Mean: Replaces missing values with the column's mean.
Median: Replaces missing values with the column's median.
Mode: Replaces missing values with the column's mode.
Constant: Allows users to replace missing values with a specified constant.
Error Handling:
Raises a ValueError if a specified column does not exist in the DataFrame.
Validates the imputation strategy and ensures the fill_value is provided for the "constant" strategy.
Scalable Design:
Designed to handle multiple columns at once, making it efficient for large datasets.
Flexible to extend with additional strategies in the future.

Tests Included:

Comprehensive unit tests have been added to validate the functionality and ensure robust error handling:

Mean Imputation: Validates the replacement of missing values with the mean.
Median Imputation: Checks correctness for median-based replacement.
Mode Imputation: Ensures mode-based imputation works as expected.
Constant Imputation: Tests custom value replacement for missing values.
Error Handling: Verifies that invalid columns raise appropriate errors.

Impact:

This feature significantly enhances the data preprocessing capabilities of the Koheesio framework.
It provides a flexible and reliable method to handle missing data, a critical step in building robust data pipelines.

dannymeijer

Thank you for your contribution @shubhamnaikk , really appreciate it.

A few points:

The placement of your modules does not quite fit with how we intend, it would be better if we placed these elsewhere;
The way you wrote the code, it is not making use of the Koheesio framework (Step classes) along with the typing that we intend to enforce.
Currently we have no module for pure python implementations (Transformation in your case) - we should add that as part of your PR if that is the intended use.

Also, can you explain the intended use for this a bit more? Would this be for ML use-cases with the input being something like pandas perhaps? Or did you have something else in mind.

I propose that we have a small meetup to discuss, as I would love to add your contribution to our library. Please reach out to me in a DM / email so we can discuss further.

dannymeijer · 2024-11-26T08:47:56Z

Please also see: #129

dannymeijer · 2024-12-09T09:41:58Z

There has not been any response for the last 2 weeks. Please respond or address the aforementioned concerns before the end of the week.

Shubham Naik added 4 commits November 24, 2024 15:02

Add advanced data validation function

1839054

Add Rolling Window Aggregation Feature

6547e9d

Fix and validate Weighted Moving Average (WMA) implementation

825a602

Add data imputation function

a52ad54

shubhamnaikk requested a review from a team as a code owner November 25, 2024 03:18

dannymeijer requested changes Nov 25, 2024

View reviewed changes

dannymeijer added the blocked label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flexible Missing Data Imputation Functionality #123

Add Flexible Missing Data Imputation Functionality #123

shubhamnaikk commented Nov 25, 2024

dannymeijer left a comment

dannymeijer commented Nov 26, 2024

dannymeijer commented Dec 9, 2024

Add Flexible Missing Data Imputation Functionality #123

Are you sure you want to change the base?

Add Flexible Missing Data Imputation Functionality #123

Conversation

shubhamnaikk commented Nov 25, 2024

Description

Tests Included:

Impact:

dannymeijer left a comment

Choose a reason for hiding this comment

dannymeijer commented Nov 26, 2024

dannymeijer commented Dec 9, 2024