Authors: Huaye Zeng, Dongfu Jiang, HaoZhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen @ TIGER-Lab
- [2025/2/3] We release the AceCoder Paper, along with the 🤗 Models and Datasets on Hugging Face.
Abstract
-
We introduce AceCoder, the first work to propose a fully automated pipeline for synthesizing large-scale reliable tests used for the reward model training and reinforcement learning in the coding scenario. To do this, we curated the dataset AceCode-89K, where we start from a seed code dataset and prompt powerful LLMs to "imagine" proper test cases for the coding question and filter the noisy ones.
-
We trained two reward model AceCodeRM-7B and AceCodeRM-32B on the constructed preference pairs. Best-of-N sampling results on HumanEval(+), MBPP(+), BigCodeBench, LiveCodeBench (V4) show consistent improvement.
-
We perform RL training from three policy models: Qwen2.5-7B-Instruct and Qwen2.5-Coder-7B-Base and Qwen2.5-Coder-7B-Instruct. Two types of reward can be used, i.e. the trained reward model RM-7B and the rule-based reward, i.e. binary pass rate over the test cases in dataset. Additionaly, we also experiment with RL from the base model like DeepSeek-R1. Results show that directly RL from the Base Qwen2.5-Coder model can get 25% improvement on HumanEval-plus and 6% on MBPP-plus within just 80 optimization steps.
-
To our knowledge, this is the first work to propose a fully automated pipeline for synthesizing large-scale reliable tests used for the reward model training and reinforcement learning in the coding scenario. We believe our \dataset{} will unlock the potential of RL training for code generation models and help the community to further push the boundaries of LLM's coding abilities.
- AceCode-89K: The first large-scale coding dataset with an average of 16 test cases per prompt, synthesized by GPT-4o-mini
- AceCodePair-300K: Constructed preference pairs from AceCode-89K for training reward model.
- AceCode-89K-hard: where you can create sample 25% of the hard examples via this script
- AceCodeRM-7B: A reward model trained on AceCodePair-300K from Qwen2.5-Coder-7B-Instruct
- AceCodeRM-32B: A reward model trained on AceCodePair-300K from Qwen2.5-Coder-32B-Instruct
Initial Policy Model | Reward Type | Training dataset | Final RL Model |
---|---|---|---|
Qwen2.5-7B-Instruct | AceCodeRM-7B | AceCode-89K-hard (22k) | TIGER-Lab/AceCoder-Qwen2.5-7B-Ins-RM |
Qwen2.5-7B-Instruct | Rule | AceCode-89K-hard (22k) | TIGER-Lab/AceCoder-Qwen2.5-7B-Ins-Rule |
Qwen2.5-Coder-7B-Instruct | AceCodeRM-7B | AceCode-89K-hard (22k) | TIGER-Lab/AceCoder-Qwen2.5-Coder-7B-Ins-RM |
Qwen2.5-Coder-7B-Instruct | Rule | AceCode-89K-hard (22k) | TIGER-Lab/AceCoder-Qwen2.5-Coder-7B-Ins-Rule |
Qwen2.5-Coder-7B | AceCodeRM-7B | AceCode-89K-hard (22k) | TIGER-Lab/AceCoder-Qwen2.5-Coder-7B-Base-RM |
Qwen2.5-Coder-7B | Rule | AceCode-89K-hard (22k) | TIGER-Lab/AceCoder-Qwen2.5-Coder-7B-Base-Rule |
See our website or paper for detailed performance report.
git submodule init
git submodule update
(TODO)
See train/train_rl/README.md for detailed instructions.
(TODO)
If you find this work helpful, please consider citing:
@article{AceCoder,
title={AceCoder: Acing Coder RL via Automated Test-Case Synthesis},
author={Zeng, Huaye and Jiang, Dongfu and Wang, Haozhe and Nie, Ping and Chen, Xiaotong and Chen, Wenhu},
journal={ArXiv},
year={2025},
volume={abs/2207.01780}
}