Generating Balanced Synthesized Data #79

erland-ramadhan · 2024-06-01T17:27:12Z

Hey, is it possible to generate the balanced synthesized data even though the realtabformer model is trained on imbalanced data (the proportion is even up to 4 to 96). How do I do that?

CTGAN, TVAE, and even be_great are able to do this simply by:
model.sample(n_samples, start_col=target_col, start_col_dist={'Yes':0.5, 'No':0.5})

The text was updated successfully, but these errors were encountered:

avsolatorio · 2024-06-10T13:50:49Z

Hello @erland-ramadhan , can you check if the seed_input parameter in the model.sample method of REaLTabFormer satisfy your need?

By the way, there is a prerequisite to using this. The target_col you want to condition must be at the beginning of the table you are synthesizing.

It could look something like below:

yes_samples = model.sample(n_samples // 2, seed_input={target_col: "Yes"})
no_samples = model.sample(n_samples // 2, seed_input={target_col: "No"})

samples = pd.concat([yes_samples, no_samples])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Balanced Synthesized Data #79

Generating Balanced Synthesized Data #79

erland-ramadhan commented Jun 1, 2024 •

edited

Loading

avsolatorio commented Jun 10, 2024 •

edited

Loading

Generating Balanced Synthesized Data #79

Generating Balanced Synthesized Data #79

Comments

erland-ramadhan commented Jun 1, 2024 • edited Loading

avsolatorio commented Jun 10, 2024 • edited Loading

erland-ramadhan commented Jun 1, 2024 •

edited

Loading

avsolatorio commented Jun 10, 2024 •

edited

Loading