Code-to-Code Translation with CodeT5+

This repository contains the code and evaluation results for our project on code-to-code translation using the CodeT5+ model[1]. Our primary focus is on the evaluation of the CodeT5+ 220M model in the domain of code-to-code translation with a dataset[2] comprising equivalent code snippets in C# and Java.

Overview

Despite initial plans to train and evaluate our own transformer architecture, we concentrated our efforts on thoroughly evaluating the pretrained CodeT5+ 220M model. This model had not been previously evaluated in the domain of code-to-code translation and was not trained on our specific dataset.

We evaluated three different versions of the CodeT5+ model:

The unmodified CodeT5+ model.
The CodeT5+ model trained from scratch with our dataset.
The CodeT5+ model fine-tuned on our dataset.

flowchart LR
    C1(CodeT5+ 220M untrained) --> Training --> E1(CodeT5+ 220M self-trained)
    C1 -..-> E1
    C2(CodeT5+ 220M pre-trained) --> Training --> E2(CodeT5+ 220M fine-tuned)
    C2 -..-> E2
    C2 -..-> E3(CodeT5+ 220M pre-trained)
    subgraph Training
        A1[(Train)]
        A3[(Validation)]
    end
    subgraph Evaluation
        direction LR
        Q2([CodeBLEU]) & Q1([BLEU]) -.-> A2[(Test)]
    end
    E1 & E2 & E3--> Evaluation --> Z[Scores]

Among these, the fine-tuned version showed the best results. Additionally, we compared these results with other models[2] evaluated on the same dataset, and the fine-tuned CodeT5+ model outperformed them, despite being the smallest in size at 220M parameters.

Repository Contents

data/: Directory containing the dataset used for training and evaluation.
models/: Directory containing scripts to load and evaluate the different versions of the CodeT5+ model.
notebooks/: Jupyter notebooks used for training, fine-tuning, and evaluation of the models.
results/: Directory containing the evaluation results, including BLEU and CodeBLEU scores.
src/: Source code for training and evaluating the transformer models.

Results

The evaluation results, including BLEU and CodeBLEU[3] scores, can be found in the results/ directory. Our analysis shows that the fine-tuned CodeT5+ model achieves the best performance on the code-to-code translation tasks.

Method	BLEU	CodeBLEU
Naive copy[2]	18.54	-
PBSMT[2]	43.53	42.71
Transformer[2]	55.84	63.74
Roborta (code)[2]	77.46	83.07
CodeBERT[2]	79.92	85.10
LLaMA3 8B (zero-shot-prompting)	49.07	35.79
CodeT5+ 220M	0.0	13.31
CodeT5+ 220M self trained	0.0	20.49
CodeT5+ 220M fine-tuned	98.44	87.55

Conclusion

Our project demonstrates the effectiveness of the CodeT5+ 220M model for code-to-code translation, particularly when fine-tuned on a specific dataset. The fine-tuned model not only outperforms the unmodified and from-scratch trained versions but also surpasses other models evaluated on the same dataset.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Feel free to explore the repository and use the provided scripts and notebooks to replicate our experiments or build upon our work. For any questions or feedback, please open an issue or contact us directly.

References

[1] Wang, Y., Le, H., Gotmare, A. D., Bui, N. D., Li, J., & Hoi, S. C. (2023). Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922. paper | code | model | blog

[2] Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., ... & Liu, S. (2021). Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664. paper | dataset

[3] Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., ... & Ma, S. (2020). Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297. paper

[4] Wang, Y., Wang, W., Joty, S., & Hoi, S. C. (2021). Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859. paper | code | model | model card | blog

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Own Transformer Architecture		Own Transformer Architecture
documentation		documentation
models/codet5p_fine_tuned		models/codet5p_fine_tuned
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
custom_transformer.ipynb		custom_transformer.ipynb
environment.yml		environment.yml
llama_validation.ipynb		llama_validation.ipynb
tune_codet5p.ipynb		tune_codet5p.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code-to-Code Translation with CodeT5+

Overview

Repository Contents

Results

Conclusion

License

Contact

References

About

Contributors 3

Languages

License

some-random-int/code-to-code

Folders and files

Latest commit

History

Repository files navigation

Code-to-Code Translation with CodeT5+

Overview

Repository Contents

Results

Conclusion

License

Contact

References

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages