Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there's a growing need to translate informal mathematical text into formal languages. However, most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets: RoMath-Synthetic, RoMath-Baccalaureate, and RoMath-Competitions. These datasets cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several language models, highlighting the importance of creating resources for underrepresented languages.
Loading the data from 🤗 Huggingface Datasets:
import datasets
subset = 'bac' # could be comps or synthetic
train_dataset = datasets.load_dataset('cosmadrian/romath', subset, split = 'train')
test_dataset = datasets.load_dataset('cosmadrian/romath', subset, split = 'test')
# Do your thing ...
While a pre-generated split for RoMath-Synthetic is provided for convenience on 🤗 HuggingFace, you can generate your own problems using the original DeepMind code with key phrases translated.
See romath-synthetic/ directory for instructions.
Experiments for the paper are organized in the in the experiments/
directory, with separate scripts for each experiment in the paper. We used SLURM on a private cluster to train, make predictions and evaluate models. Use ./do_sbatch.sh <script.sh> <n_gpus>
to run a particular bash script. Modify the ./do_sbatch.sh
file to suit your needs.
To run a particular model on a dataset use the following commands:
# Optional LoRA-Fine-tuning
python fine_tune.py --model <hf_model_name> --dataset [bac|comps|synthetic] --output checkpoints/
# Use a (trained) model to make predictions on a test set.
python predict.py --model <hf_model_name> --dataset [bac|comps|synthetic] --temperature 0.5 --k 3 --shots 5 --output predictions/
# Evaluate the predictions of a model using a judge model.
python evaluate.py --pred_file predictions/Qwen-Qwen2-1.5B-Instruct_bac_2_0.5.csv --judge_model <hf_model_name> --output results/
# Compute the relevant metrics for all evaluated prediction files in a folder.
python evaluate/compute_metrics.py --input_dir results/ --output_dir metrics/
For translation, use the translate.py
python script, alongside the predict_translated.py
script.
For constructing the Judge Dataset (i.e., Table 3), run the evaluate/make_judge_dataset.py
with the appropriate arguments and run evaluate_judge.py
script.
If you found our work useful, please cite our paper:
RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴
@misc{cosma2024romath,
title={RoMath: A Mathematical Reasoning Benchmark in Romanian},
author={Adrian Cosma and Ana-Maria Bucur and Emilian Radoi},
year={2024},
eprint={2409.11074},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.11074},
}
This work is protected by Attribution-NonCommercial 4.0 International