Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Split] random split method improvement #353

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/source/modules/transforms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,17 @@ Let's look an example, where we apply `CatToNumTransform <https://dl.acm.org/doi
from torch_frame.datasets import Yandex
from torch_frame.transforms import CatToNumTransform
from torch_frame import stype
from torch_frame.typing import TrainingStage

dataset = Yandex(root='/tmp/adult', name='adult')
dataset.materialize()
transform = CatToNumTransform()
train_dataset = dataset.get_split('train')
train_dataset = dataset.get_split(TrainingStage.TRAIN)

train_dataset.tensor_frame.col_names_dict[stype.categorical]
>>> ['C_feature_0', 'C_feature_1', 'C_feature_2', 'C_feature_3', 'C_feature_4', 'C_feature_5', 'C_feature_6', 'C_feature_7']

test_dataset = dataset.get_split('test')
test_dataset = dataset.get_split(TrainingStage.TEST)
transform.fit(train_dataset.tensor_frame, dataset.col_stats)

transformed_col_stats = transform.transformed_stats
Expand Down
3 changes: 2 additions & 1 deletion test/transforms/test_mutual_information_sort.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from torch_frame.data import Dataset
from torch_frame.datasets.fake import FakeDataset
from torch_frame.transforms import MutualInformationSort
from torch_frame.typing import TrainingStage


@pytest.mark.parametrize('with_nan', [True, False])
Expand All @@ -19,7 +20,7 @@ def test_mutual_information_sort(with_nan):
dataset.materialize()

tensor_frame: TensorFrame = dataset.tensor_frame
train_dataset = dataset.get_split('train')
train_dataset = dataset.get_split(TrainingStage.TRAIN)
transform = MutualInformationSort(task_type)
transform.fit(train_dataset.tensor_frame, train_dataset.col_stats)
out = transform(tensor_frame)
Expand Down
27 changes: 21 additions & 6 deletions test/utils/test_split.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import numpy as np

from torch_frame.datasets import FakeDataset
from torch_frame.typing import TrainingStage
from torch_frame.utils.split import SPLIT_TO_NUM, generate_random_split


Expand All @@ -9,13 +11,26 @@ def test_generate_random_split():
val_ratio = 0.1
test_ratio = 0.1

split = generate_random_split(num_data, seed=42, train_ratio=train_ratio,
val_ratio=val_ratio)
assert (split == SPLIT_TO_NUM['train']).sum() == int(num_data *
train_ratio)
assert (split == SPLIT_TO_NUM['val']).sum() == int(num_data * val_ratio)
assert (split == SPLIT_TO_NUM['test']).sum() == int(num_data * test_ratio)
split = generate_random_split(num_data, seed=42,
ratios=[train_ratio, val_ratio])
Comment on lines +14 to +15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want this to be a list? I think the previous argument is clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format is more aligned with split method in torch https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#random_split. And that to support split with only train and validation data with the new format, we are able to do split(ratios=[0.8], ...) while with the old format, we could only do split(train_ratio=0.8, val_ratio=None, ...) which is a bit weird, because this indicates a train/test split instead of a train/val split which is more common.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but here we are actually assigning TRAIN to the first ratio, VAL to the second ratio, and so on. So I still think the previous one is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch's random split makes sense since it just splits the dataset. There is no train/val/test assignment happening under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous format we always assume we want to have a test set. So

  1. it does not have the flexibility to doing something like (train: 0.3, val: 0.1, test: 0.2)
  2. when we do train: 0.8, val: 0.0. Then the inferred test ratio is 0.2. Which is a bit weird because in this case only train and val (test) split is well-defined. i.e. we only intend to split the data into 2 parts instead of 3 parts. But we end up with an additional validation set that is completely empty

We could also keep the previous format but support the case where val ratio could be None which indicates there is no validation set instead of an empty validation set if you have strong opinion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you. but I still want features based on needs.
I don't find these logics useful for now, and having new interface would break other parts. For instance, this PR does not modify generate_random_split used in data_frame_benchmark.py and elsewhere.

Copy link
Contributor Author

@XinweiHe XinweiHe Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, this PR does not modify generate_random_split used in data_frame_benchmark.py and elsewhere

This is still a draft now. I kind of want to sync with you on this first before changing other parts of the code to make them compatible.

assert (split == SPLIT_TO_NUM.get(TrainingStage.TRAIN)).sum() == int(
num_data * train_ratio)
assert (split == SPLIT_TO_NUM.get(TrainingStage.VAL)).sum() == int(
num_data * val_ratio)
assert (split == SPLIT_TO_NUM.get(TrainingStage.TEST)).sum() == int(
num_data * test_ratio)
assert np.allclose(
split,
np.array([0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0]),
)


def test_split_e2e_basic():
# TODO: Add several more test cases using @pytest.mark.parametrize
num_rows = 10
dataset = FakeDataset(num_rows=num_rows).materialize()
dataset.random_split([0.5, 0.2])
train_set, val_set, test_set = dataset.split()
train_set.num_rows, val_set.num_rows == (int(10 * 0.5), int(10 * 0.2))
if test_set is not None:
test_set.num_rows == 10 - int(10 * 0.5) - int(10 * 0.2)
30 changes: 18 additions & 12 deletions torch_frame/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,9 @@
IndexSelectType,
TaskType,
TensorData,
TrainingStage,
)
from torch_frame.utils.split import SPLIT_TO_NUM
from torch_frame.utils.split import SPLIT_TO_NUM, generate_random_split

COL_TO_PATTERN_STYPE_MAPPING = {
"col_to_sep": torch_frame.multicategorical,
Expand Down Expand Up @@ -695,7 +696,7 @@ def col_select(self, cols: ColumnSelectType) -> Dataset:

return dataset

def get_split(self, split: str) -> Dataset:
def get_split(self, split: TrainingStage) -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's nicer to accept str as input here. It's sometimes troublesome to import TrainingStage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can support both? i.e. class TrainingStage(str, Enum) I feel like TrainingStage has its advantage in that no mistypes will occur that causes unexpected behavior, e.g. 'Train' vs 'train'. I feel like this is also the reason we support lots of Enum classes in the codebase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can support both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union[str, TrainingStage]

r"""Returns a subset of the dataset that belongs to a given training
split (as defined in :obj:`split_col`).

Expand All @@ -707,20 +708,25 @@ def get_split(self, split: str) -> Dataset:
raise ValueError(
f"'get_split' is not supported for '{self}' since 'split_col' "
f"is not specified.")
if split not in ["train", "val", "test"]:
raise ValueError(f"The split named '{split}' is not available. "
f"Needs to be either 'train', 'val', or 'test'.")
indices = self.df.index[self.df[self.split_col] ==
SPLIT_TO_NUM[split]].tolist()
return self[indices]

def split(self) -> tuple[Dataset, Dataset, Dataset]:
r"""Splits the dataset into training, validation and test splits."""
return (
self.get_split("train"),
self.get_split("val"),
self.get_split("test"),
)
def split(self) -> tuple[Dataset, Dataset, Dataset | None]:
r"""Splits the dataset into training, validation and optionally
test splits.
"""
train_set = self.get_split(TrainingStage.TRAIN)
val_set = self.get_split(TrainingStage.VAL)
test_set = self.get_split(TrainingStage.TEST)
if test_set.num_rows == 0:
test_set = None
return train_set, val_set, test_set

def random_split(self, ratios: list[float] | None = None):
split = generate_random_split(self.num_rows, ratios)
self.split_col = 'split'
self.df[self.split_col] = split
Comment on lines +726 to +729
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to supply an argument to control the random seed of the random seed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add doc-string so that people know how to use it.


@property
@requires_post_materialization
Expand Down
8 changes: 4 additions & 4 deletions torch_frame/datasets/fake.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from torch_frame import stype
from torch_frame.config.text_embedder import TextEmbedderConfig
from torch_frame.config.text_tokenizer import TextTokenizerConfig
from torch_frame.typing import TaskType
from torch_frame.typing import TaskType, TrainingStage
from torch_frame.utils.split import SPLIT_TO_NUM

TIME_FORMATS = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d', '%Y/%m/%d']
Expand Down Expand Up @@ -189,9 +189,9 @@ def __init__(
if num_rows < 3:
raise ValueError("Dataframe needs at least 3 rows to include"
" each of train, val and test split.")
split = [SPLIT_TO_NUM['train']] * num_rows
split[1] = SPLIT_TO_NUM['val']
split[2] = SPLIT_TO_NUM['test']
split = [SPLIT_TO_NUM.get(TrainingStage.TRAIN)] * num_rows
split[1] = SPLIT_TO_NUM.get(TrainingStage.VAL)
split[2] = SPLIT_TO_NUM.get(TrainingStage.TEST)
df['split'] = split

super().__init__(
Expand Down
7 changes: 4 additions & 3 deletions torch_frame/datasets/huggingface_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import torch_frame
from torch_frame import stype
from torch_frame.typing import TrainingStage
from torch_frame.utils.infer_stype import infer_df_stype
from torch_frame.utils.split import SPLIT_TO_NUM

Expand Down Expand Up @@ -105,13 +106,13 @@ def __init__(

# Transform HF dataset split to `SPLIT_TO_NUM` accepted one:
if "train" in split_name:
split_names.append("train")
split_names.append(TrainingStage.TRAIN)
elif "val" in split_name:
# Some datasets have val split name as `"validation"`,
# here we transform it to `"val"`:
split_names.append("val")
split_names.append(TrainingStage.VAL)
elif "test" in split_name:
split_names.append("test")
split_names.append(TrainingStage.TEST)
else:
raise ValueError(f"Invalid split name: '{split_name}'. "
f"Expected one of the following PyTorch "
Expand Down
5 changes: 3 additions & 2 deletions torch_frame/datasets/mercari.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import torch_frame
from torch_frame.config.text_embedder import TextEmbedderConfig
from torch_frame.typing import TrainingStage
from torch_frame.utils.split import SPLIT_TO_NUM

SPLIT_COL = 'split_col'
Expand Down Expand Up @@ -64,8 +65,8 @@ def __init__(
test_path = osp.join(self.base_url, 'test_stg2.csv')
self.download_url(test_path, root)
df_test = pd.read_csv(test_path)
df_train[SPLIT_COL] = SPLIT_TO_NUM['train']
df_test[SPLIT_COL] = SPLIT_TO_NUM['test']
df_train[SPLIT_COL] = SPLIT_TO_NUM[TrainingStage.TRAIN]
df_test[SPLIT_COL] = SPLIT_TO_NUM[TrainingStage.TEST]
df = pd.concat([df_train, df_test], axis=0, ignore_index=True)
if num_rows is not None:
df = df.head(num_rows)
Expand Down
6 changes: 6 additions & 0 deletions torch_frame/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ def supports_task_type(self, task_type: 'TaskType') -> bool:
return self in task_type.supported_metrics


class TrainingStage(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief doc-string.

TRAIN = 'train'
VAL = 'val'
TEST = 'test'


class TaskType(Enum):
r"""The type of the task.

Expand Down
79 changes: 67 additions & 12 deletions torch_frame/utils/split.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,85 @@
import math

import numpy as np

from typing import List
from torch_frame.typing import TrainingStage

# Mapping split name to integer.
SPLIT_TO_NUM = {'train': 0, 'val': 1, 'test': 2}
SPLIT_TO_NUM = {
TrainingStage.TRAIN: 0,
TrainingStage.VAL: 1,
TrainingStage.TEST: 2
}


def generate_random_split(length: int, seed: int, train_ratio: float = 0.8,
val_ratio: float = 0.1) -> np.ndarray:
def generate_random_split(
length: int,
ratios: List[float],
seed: int = 0,
) -> np.ndarray:
r"""Generate a list of random split assignments of the specified length.
The elements are either :obj:`0`, :obj:`1`, or :obj:`2`, representing
train, val, test, respectively. Note that this function relies on the fact
that numpy's shuffle is consistent across versions, which has been
historically the case.

Args:
length (int): The length of the dataset.
ratios (List[float]): Ratios for split assignment. When ratios
contains 2 variables, we will generate train/val/test set
respectively based on the split ratios (the 1st variable in
the list will be the ratio for train set, the 2nd will be
the ratio for val set and the remaining data will be used
for test set). When ratios contains 1 variable, we will only
generate train/val set. (the variable in)
seed (int, optional): The seed for the randomness generator.

Returns:
A np.ndarra object representing the split.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ndarray

"""
assert train_ratio + val_ratio < 1
assert train_ratio > 0
assert val_ratio > 0
train_num = int(length * train_ratio)
val_num = int(length * val_ratio)
test_num = length - train_num - val_num
validate_split_ratios(ratios)
ratios_length = len(ratios)
if length < ratios_length + 1:
raise ValueError(
f"We want to split data into {ratios_length + 1} disjoint set. "
f"However data contains {length} data point. Consider "
f"increase your data size.")

# train_num = int(length * train_ratio)
# val_num = int(length * val_ratio)
# test_num = length - train_num - val_num
train_num = math.floor(length * ratios[0])
val_num = math.floor(
length * ratios[1]) if ratios_length == 2 else length - train_num
test_num = None
if ratios_length == 2:
test_num = length - train_num - val_num

arr = np.concatenate([
np.full(train_num, SPLIT_TO_NUM['train']),
np.full(val_num, SPLIT_TO_NUM['val']),
np.full(test_num, SPLIT_TO_NUM['test'])
np.full(train_num, SPLIT_TO_NUM.get(TrainingStage.TRAIN)),
np.full(val_num, SPLIT_TO_NUM.get(TrainingStage.VAL)),
])

if ratios_length == 2:
arr = np.concatenate(
[arr, np.full(test_num, SPLIT_TO_NUM.get(TrainingStage.TEST))])

np.random.seed(seed)
np.random.shuffle(arr)

return arr


def validate_split_ratios(ratio: List[float]):
if len(ratio) > 2:
raise ValueError("No more than three training splits is supported")
if len(ratio) < 1:
raise ValueError("At least two training splits are required")

for val in ratio:
if val < 0:
raise ValueError("'ratio' can not contain negative values")

if sum(ratio) - 1 > 1e-2:
raise ValueError("'ratio' exceeds more than 100% of the data")