-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added categorical encoding function #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for your contribution, it's very much appreciated.
Please see the comments I gave above.
A few things:
- Your code should be adjusted to use Koheesio Step class, preferably a pandas adjusted version of the Transformation class we have in the spark module
- Your code should be moved to an appropriate module
- The extra dependency you're introducing is not part of the pyproject atm
- Also, I would like you to add extra documentation (module docstring) to explain your usecase; add some examples as well once you've adjusted your code
I would love to discuss with you the intend of what you are trying to achieve. Feel free to reach out in a DM/email - my contact information is in my profile (LinkedIn, email).
Please also see: #129 |
This reverts commit 60c81dc.
…umpy based docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to look really good! :)
I left some detailed comments on what I'd like to see change
"pandas>=1.5.0", | ||
"scikit-learn>=1.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to make these top level dependencies + we already have pandas as an extra dependency.
Let's make an extra called "ml" and put scikit-learn in there. That way you can install the extra dependency as koheesio[pandas,ml]
import pandas as pd | ||
from pydantic import BaseModel | ||
|
||
class PandasCategoricalEncoding(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this class a PandasStep
: from koheesio.pandas import PandasStep
""" | ||
|
||
columns: List[str] | ||
encoding_type: str = "one-hot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the type to Literal["one-hot", "ordinal"]
, that way you don't need the extra check you put in the __init__
method
drop_first: bool = True | ||
ordinal_mapping: Dict[str, Dict] = None | ||
|
||
def __init__(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes obsolete if you make the type a Literal as stated above
columns: List[str] | ||
encoding_type: str = "one-hot" | ||
drop_first: bool = True | ||
ordinal_mapping: Dict[str, Dict] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make all these Fields, example:
from koheesio.models import Field
...
class PandasCategoricalEncoding(PandasStep):
columns: List[str] = Field(..., description="...")
encoding_type: Literal["one-hot", "ordinal"] = Field(default="one-hot", description="...")
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and of course add appropriate description to each)
def execute(self, data: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Executes the categorical encoding transformation on the provided dataset. | ||
|
||
Parameters | ||
---------- | ||
data : pd.DataFrame | ||
The input dataset to encode. | ||
|
||
Returns | ||
------- | ||
pd.DataFrame | ||
The dataset with the specified categorical columns encoded. | ||
""" | ||
if self.encoding_type == 'one-hot': | ||
data = pd.get_dummies(data, columns=self.columns, drop_first=self.drop_first) | ||
elif self.encoding_type == 'ordinal': | ||
for column in self.columns: | ||
if column in data.columns and self.ordinal_mapping and column in self.ordinal_mapping: | ||
data[column] = data[column].map(self.ordinal_mapping[column]).fillna(-1).astype(int) | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few things about execute
(for when you change to Step
:
- execute takes no arguments, instead add the DataFrame as one of the input Fields; add something like this
df: Optional[pd.DataFrame] = Field(default=None, description="...")
- I will explain why you want this as Optional in a bit - execute is expected to deal with input (your Fields) and generate Output
- this Output does not need to be returned explicitly, the Step parent-class takes care of this.
- instead, add a
.transform
method that can take a DataFrame as input
This means you can change your code like this:
- add an
Output
class - add a
.transform
method - update your
execute
method accordingly
Should look something like this (of course add docstrings and things like that):
class PandasCategoricalEncoding(PandasStep):
...
class Output(PandasStep.Output):
df: pd.DataFrame = Field(..., description="output pandas DataFrame"
def transform(self, df: Optional[pd.DataFrame]):
self.df = df or self.df
if not self.df:
raise RuntimeError("No valid Dataframe was passed")
self.execute()
return self.output.df
def execute(self) -> Output:
if self.encoding_type == 'one-hot':
self.output.df = pd.get_dummies(self.df, columns=self.columns, drop_first=self.drop_first)
elif self.encoding_type == 'ordinal':
data = self.df
for column in self.columns:
for column in self.columns:
if column in data.columns and self.ordinal_mapping and column in self.ordinal_mapping:
data[column] = data[column].map(self.ordinal_mapping[column]).fillna(-1).astype(int)
self.output.df = data
This way you can interact with your class this way:
encoding_step = PandasCategoricalEncoding(
columns=["color"],
encoding_type="one-hot",
drop_first=False # Adjusted to match expected columns
)
encoded_data = encoding_step.transform(self.data)
This way this interface matches what we do for Spark.
Note: I will work on making Transformation base classes for Pandas in a separate PR.
For reference:
koheesio/src/koheesio/spark/transformations/__init__.py
Lines 35 to 192 in 9bd29ec
class Transformation(SparkStep, ABC): | |
"""Base class for all transformations | |
Concept | |
------- | |
A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is | |
transformed based on the logic implemented in the `execute` method. Any additional parameters that are needed for | |
the transformation can be passed to the constructor. | |
Parameters | |
---------- | |
df : Optional[DataFrame] | |
The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the | |
transform-method. | |
Example | |
------- | |
### Implementing a transformation using the Transformation class: | |
```python | |
from koheesio.steps.transformations import Transformation | |
from pyspark.sql import functions as f | |
class AddOne(Transformation): | |
target_column: str = "new_column" | |
def execute(self): | |
self.output.df = self.df.withColumn( | |
self.target_column, f.col("old_column") + 1 | |
) | |
``` | |
In the example above, the `execute` method is implemented to add 1 to the values of the `old_column` and store the | |
result in a new column called `new_column`. | |
### Using the transformation: | |
In order to use this transformation, we can call the `transform` method: | |
```python | |
from pyspark.sql import SparkSession | |
# create a DataFrame with 3 rows | |
df = SparkSession.builder.getOrCreate().range(3) | |
output_df = AddOne().transform(df) | |
``` | |
The `output_df` will now contain the original DataFrame with an additional column called `new_column` with the | |
values of `old_column` + 1. | |
__output_df:__ | |
|id|new_column| | |
|--|----------| | |
| 0| 1| | |
| 1| 2| | |
| 2| 3| | |
... | |
### Alternative ways to use the transformation: | |
Alternatively, we can pass the DataFrame to the constructor and call the `execute` or `transform` method without | |
any arguments: | |
```python | |
output_df = AddOne(df).transform() | |
# or | |
output_df = AddOne(df).execute().output.df | |
``` | |
> Note: that the transform method was not implemented explicitly in the AddOne class. This is because the `transform` | |
method is already implemented in the `Transformation` class. This means that all classes that inherit from the | |
Transformation class will have the `transform` method available. Only the execute method needs to be implemented. | |
### Using the transformation as a function: | |
The transformation can also be used as a function as part of a DataFrame's `transform` method: | |
```python | |
input_df = spark.range(3) | |
output_df = input_df.transform(AddOne(target_column="foo")).transform( | |
AddOne(target_column="bar") | |
) | |
``` | |
In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform` | |
method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and | |
`bar', each with the values of `id` + 1. | |
""" | |
df: Optional[DataFrame] = Field(default=None, description="The Spark DataFrame") | |
@abstractmethod | |
def execute(self) -> SparkStep.Output: | |
"""Execute on a Transformation should handle self.df (input) and set self.output.df (output) | |
This method should be implemented in the child class. The input DataFrame is available as `self.df` and the | |
output DataFrame should be stored in `self.output.df`. | |
For example: | |
```python | |
def execute(self): | |
self.output.df = self.df.withColumn( | |
"new_column", f.col("old_column") + 1 | |
) | |
``` | |
The transform method will call this method and return the output DataFrame. | |
""" | |
# self.df # input dataframe | |
# self.output.df # output dataframe | |
self.output.df = ... # implement the transformation logic | |
raise NotImplementedError | |
def transform(self, df: Optional[DataFrame] = None) -> DataFrame: | |
"""Execute the transformation and return the output DataFrame | |
Note: when creating a child from this, don't implement this transform method. Instead, implement execute! | |
See Also | |
-------- | |
`Transformation.execute` | |
Parameters | |
---------- | |
df: Optional[DataFrame] | |
The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor | |
will be used. | |
Returns | |
------- | |
DataFrame | |
The transformed DataFrame | |
""" | |
self.df = df or self.df | |
if not self.df: | |
raise RuntimeError("No valid Dataframe was passed") | |
self.execute() | |
return self.output.df | |
def __call__(self, *args, **kwargs): | |
"""Allow the class to be called as a function. | |
This is especially useful when using a DataFrame's transform method. | |
Example | |
------- | |
```python | |
input_df = spark.range(3) | |
output_df = input_df.transform(AddOne(target_column="foo")).transform( | |
AddOne(target_column="bar") | |
) | |
``` | |
In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform` | |
method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and | |
`bar', each with the values of `id` + 1. | |
""" | |
return self.transform(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making df
an Optional
type allows us to either give the df
as an argument when initializing the class, or pass it through transform
- this is exactly how we do it inside the Spark module at the moment
import unittest | ||
import pandas as pd | ||
|
||
class TestPandasCategoricalEncoding(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use unittest - use pytest.
- get rid of the unittest.TestCase (just let it be a regular class)
- change your
self.assert...
(from unittest) to regular pythonassert
(this is how pytest works) - get rid of your setUp - just make the input dataframe a module level variable, OR make it a fixture (a bit overkill for your purpose here)
- change your code to match the interface I proposed above
import pandas as pd | ||
from src.koheesio.pandas.categorical_encoding import PandasCategoricalEncoding | ||
import unittest | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're importing pandas twice. Also, import pandas through the koheesio module to avoid conflict:
from koheesio.pandas import pandas as pd
from koheesio.steps import Step | ||
|
||
from typing import List, Dict | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pandas through the koheesio module (as stated above) to avoid conflict:
from koheesio.pandas import pandas as pd
Please run isort (through make fmt
or hatch fmt,
or run ruff
)
Since no response was provided since several weeks, I am closing this PR. Please re-open a new contribution request once you feel ready to do so and once the concerns have been addressed. |
Description
This pull request introduces a new categorical encoding feature to the project. The function supports One-Hot Encoding and Ordinal Encoding for categorical variables, allowing users to efficiently transform categorical data into numerical formats. This addition is designed to enhance data preprocessing capabilities within the framework.
The changes include:
categorical_encoding.py
under thesrc/koheesio/
directory.categorical_encoding
function.EncodingConfig
class for user-configurable options (e.g., encoding type).tests/test_categorical_encoding.py
.Related Issue
This pull request addresses the need for robust categorical data transformation functionality as identified in the enhancement discussions. (Link to any related issue or discussion if applicable, or mention "N/A" if there isn't one.)
Motivation and Context
This change is required to handle categorical data during data preprocessing for machine learning or analytics workflows. The new feature provides:
This functionality improves the versatility and usability of the Koheesio framework in real-world scenarios.
How Has This Been Tested?
The implementation was tested using unit tests created in
tests/test_categorical_encoding.py
. The tests include:All tests were executed in the local environment using Python's unittest framework, and they passed successfully without affecting other parts of the codebase.
Screenshots (if appropriate):
N/A
Types of Changes
Checklist