Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating Balanced Synthesized Data #79

Open
erland-ramadhan opened this issue Jun 1, 2024 · 1 comment
Open

Generating Balanced Synthesized Data #79

erland-ramadhan opened this issue Jun 1, 2024 · 1 comment

Comments

@erland-ramadhan
Copy link

erland-ramadhan commented Jun 1, 2024

Hey, is it possible to generate the balanced synthesized data even though the realtabformer model is trained on imbalanced data (the proportion is even up to 4 to 96). How do I do that?

CTGAN, TVAE, and even be_great are able to do this simply by:
model.sample(n_samples, start_col=target_col, start_col_dist={'Yes':0.5, 'No':0.5})

@avsolatorio
Copy link
Member

avsolatorio commented Jun 10, 2024

Hello @erland-ramadhan , can you check if the seed_input parameter in the model.sample method of REaLTabFormer satisfy your need?

By the way, there is a prerequisite to using this. The target_col you want to condition must be at the beginning of the table you are synthesizing.

It could look something like below:

yes_samples = model.sample(n_samples // 2, seed_input={target_col: "Yes"})
no_samples = model.sample(n_samples // 2, seed_input={target_col: "No"})

samples = pd.concat([yes_samples, no_samples])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants