This repository contains supplementary material for the paper "Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey", submitted to the ACM Computing Surveys journal.
Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case generation. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls, which hinder realistic performance and further impact their reliability and applicability in real-world deployment. Such challenges drive the need for a comprehensive understanding - not just identifying these issues but delving into their possible implications and existing solutions to build more reliable language models tailored to code intelligence. Based on a well-defined systematic research approach, we conducted an extensive literature review to uncover the pitfalls inherent in LM4Code. Finally, 67 primary studies from top-tier venues have been identified. After carefully examining these studies, we designed a taxonomy of pitfalls in LM4Code research and conducted a systematic study to summarize the issues, implications, current solutions, and challenges of different pitfalls for LM4Code systems. We developed a comprehensive classification scheme that dissects pitfalls across four crucial aspects: data collection and labeling, system design and learning, performance evaluation, and deployment and maintenance. Through this study, we aim to provide a roadmap for researchers and practitioners, facilitating their understanding and utilization of LM4Code in reliable and trustworthy ways.
Please feel free to send a pull request to add papers and relevant content that are not listed here. We uploaded our completed paper lists to Google Drive with detailed reviewed information.
- Dos and Don'ts of Machine Learning in Computer Security (2022), USENIX Security, D Arp, et al. [pdf]
- Machine/deep learning for software engineering: A systematic literature review (2022), TSE, Simin Wang, et al. [pdf]
- Trustworthy AI: From principles to practices (2023), arxiv, BO Li, et al. [pdf]
- Deep Learning Based Vulnerability Detection (2021), arxiv, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
- Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays! (2023), ICSE, X Yang, et al. [pdf]
- On the Value of Oversampling for Deep Learning in Software Defect Prediction (2021), TSE, R Yedida, T Menzies. [pdf]
- Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets (2022), ASE, Z Li, et al. [pdf]
- An empirical study of deep learning models for vulnerability detection (2023), arxiv, B Steenhoek, et al. [pdf]
- Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets (2022), ASE, Z Li, et al. [pdf]
- XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM, Z Lin, et al. [pdf]
- Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper) (2023), ISSTA, X Nie, et al. [pdf]
- Slice-Based Code Change Representation Learning (2023), SANER, F Zhang, et al. [pdf]
- Are we building on the rock? on the importance of data preprocessing for code summarization (2022), FSE, L Shi, et al. [pdf]
- Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? (2018), ASE, Z Liu, et al. [pdf]
- AutoTransform: automated code transformation to support modern code review process (2022), ICSE, Thongtanunam, Patanamon, Chanathip Pornprasit, and Chakkrit Tantithamthavorn. [pdf]
- Can Neural Clone Detection Generalize to Unseen Functionalitiesƒ (2021), ASE, C Liu, et al. [pdf]
- CD-VulD: Cross-Domain Vulnerability Discovery Based on Deep Domain Adaptation (2020), TDSC, S Liu, et al. [pdf]
- Deep just-in-time defect prediction: how far are we? (2021), ISSTA, Z Zeng, et al. [pdf]
- Patching as translation: the data and the metaphor (2020), ASE, Y Ding, et al. [pdf]
- An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, et al. [pdf]
- Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence Models (2302), ICSE, S Gao, et al. [pdf]
- Revisiting Learning-based Commit Message Generation (2023), ICSE, J Dong, Y Lou, D Hao, et al. [pdf]
- Syntax and Domain Aware Model for Unsupervised Program Translation (2302), ICSE, F Liu, J Li, L Zhang. [pdf]
- How Effective Are Neural Networks for Fixing Security Vulnerabilities (2023), ISSTA, Y Wu, N Jiang, HV Pham, et al. [pdf]
- Towards More Realistic Evaluation for Neural Test Oracle Generation (2305), ISSTA, Z Liu, K Liu, X Xia, et al. [pdf]
- On the Evaluation of Neural Code Summarization (2022), ICSE, E Shi, Y Wang, L Du, et al. [pdf]
- Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
- Diet code is healthy: simplifying programs for pre-trained models of code (2022), FSE, Z Zhang, H Zhang, B Shen, et al. [pdf]
- Explaining mispredictions of machine learning models using rule induction (2021), FSE, J Cito, I Dillig, S Kim, et al. [pdf]
- Interpreting Deep Learning-based Vulnerability Detector Predictions Based on Heuristic Searching (2021), TOSEM, D Zou, Y Zhu, S Xu, et al. [pdf]
- Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code (2021), ASE, M Paltenghi, M Pradel. [pdf]
- Vulnerability detection with fine-grained interpretations (2021), FSE, Y Li, S Wang, TN Nguyen. [pdf]
- What do they capture? a structural analysis of pre-trained language models for source code (2022), ICSE, Y Wan, W Zhao, H Zhang, et al. [pdf]
- An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, MM Rahman, R Jiles, et al. [pdf]
- Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond (2023), ISSTA, E Shi, Y Wang, H Zhang, et al. [pdf]
- Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
- Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking (2022), TSE, H Wang, P Ma, Y Yuan, et al. [pdf]
- Improving automatic source code summarization via deep reinforcement learning (2018), ASE, Y Wan, Z Zhao, M Yang, et al.[pdf]
- Patching as translation: the data and the metaphor (2020), ASE, Y Ding, B Ray, P Devanbu, et al.[pdf]
- Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention (2020), TSE, W Wang, Y Zhang, Y Sui, et al. [pdf]
- XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM, Z Lin, G Li, J Zhang, et al. [pdf]
- RepresentThemAll: A Universal Learning Representation of Bug Reports (2023), ICSE, S Fang, T Zhang, Y Tan, et al. [pdf]
- Template-based Neural Program Repair (2023), ICSE, X Meng, X Wang, H Zhang, et al. [pdf]
- Towards More Realistic Evaluation for Neural Test Oracle Generationr (2023), ARXIV, Z Liu, K Liu, X Xia, et al. [pdf]
- Deep Learning Based Program Generation From Requirements Text: Are We There Yet? (2020), TSE, H Liu, M Shen, J Zhu, et al. [pdf]
- Generating realistic vulnerabilities via neural code editing: an empirical study (2022), FSE, Y Nong, Y Ou, M Pradel, et al. [pdf]
- An extensive study on pre-trained models for program understanding and generation (2022), ISSTA, Z Zeng, H Tan, H Zhang, et al. [pdf]
- Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
- Improving automatic source code summarization via deep reinforcement learning (2018), ASE, Y Wan, Z Zhao, M Yang, et al. [pdf]
- Multi-task learning based pre-trained language model for code completion (2020), ASE, F Liu, G Li, Y Zhao, et al. [pdf]
- On the Value of Oversampling for Deep Learning in Software Defect Prediction (2021), TSE, R Yedida, T Menzies. [pdf]
- Patching as translation: the data and the metaphor (2020), ASE, Y Ding, B Ray, P Devanbu, et al. [pdf]
- Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention (2020), TSE, W Wang, Y Zhang, Y Sui, et al. [pdf]
- SynShine: Improved Fixing of Syntax Errors (2022), TSE, Ahmed T, Ledesma N R, Devanbu P. [pdf]
- An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, MM Rahman, R Jiles, et al. [pdf]
- Revisiting Learning-based Commit Message Generation (2023), ICSE, J Dong, Y Lou, D Hao, et al. [pdf]
- Tare: Type-Aware Neural Program Repair (2023), ICSE, Q Zhu, Z Sun, W Zhang, et al. [pdf]
- How Effective Are Neural Networks for Fixing Security Vulnerabilities (2023), ISSTA, Y Wu, N Jiang, HV Pham, et al. [pdf]
- Towards More Realistic Evaluation for Neural Test Oracle Generation (2305), ISSTA, Z Liu, K Liu, X Xia, et al. [pdf]
- GitHub Copilot AI pair programmer: Asset or Liability? (2023), JSS, AM Dakhel, V Majdinasab, A Nikanjam, et al. [pdf]
- Examining Zero-Shot Vulnerability Repair with Large Language Models (2023), S&P, H Pearce, B Tan, B Ahmad, et al. [pdf]
- A Performance-Sensitive Malware Detection System Using Deep Learning on Mobile Devices (2020), TIFS, R Feng, S Chen, X Xie, et al. [pdf]
- Diet code is healthy: simplifying programs for pre-trained models of code (2022), FSE, Z Zhang, H Zhang, B Shen, et al.[pdf]
- When Code Completion Fails: A Case Study on Real-World Completions (2019), ICSE, VJ Hellendoorn, S Proksch, HC Gall, et al. [pdf]
- Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants (2023), arxiv, G Sandoval, H Pearce, T Nys, et al. [pdf]
- Grounded Copilot: How Programmers Interact with Code-Generating Models (2023), OOPSLA1, S Barke, MB James, N Polikarpova. [pdf]
- LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (2308), arxiv, J Lu, L Yu, X Li, et al.[pdf]
- Compressing Pre-trained Models of Code into 3 MB (2022), ASE, J Shi, Z Yang, B Xu, et al.[pdf]
- You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion (2021), USENIX Security, R Schuster, C Song, E Tromer, et al. [pdf]
- Adversarial Robustness of Deep Code Comment Generation (2022), TOSEM, Y Zhou, X Zhang, J Shen, et al. [pdf]
- An extensive study on pre-trained models for program understanding and generation (2022), ISSTA, Z Zeng, H Tan, H Zhang, et al. [pdf]
- Generating Adversarial Examples for Holding Robustness of Source Code Processing Models (2020), AAAI, H Zhang, Z Li, G Li, et al. [pdf]
- Semantic Robustness of Models of Source Code (2020), SANER, G Ramakrishnan, J Henkel, Z Wang, et al. [pdf]
- You see what I want you to see: poisoning vulnerabilities in neural code search (2022), FSE, Y Wan, S Zhang, H Zhang, et al. [pdf]
- Contrabert: Enhancing code pre-trained models via contrastive learning (2023), ICSE, S Liu, B Wu, X Xie, et al. [pdf]
- On the robustness of code generation techniques: An empirical study on github copilot (2023), ICSE, A Mastropaolo, L Pascarella, E Guglielmi, et al. [pdf]
- Two sides of the same coin: Exploiting the impact of identifiers in neural code comprehension (2023), ICSE, S Gao, C Gao, C Wang, et al. [pdf]
- Multi-target Backdoor Attacks for Code Pre-trained Models (2023), ACL, Y Li, S Liu, K Chen, et al. [pdf]
- Backdooring Neural Code Search (2023), ACL, W Sun, Y Chen, G Tao, et al. [pdf]
- ReCode: Robustness Evaluation of Code Generation Models (2022), ACL, S Wang, Z Li, H Qian, et al. [pdf]
- Natural Attack for Pre-trained Models of Code (2022), ICSE, Z Yang, J Shi, J He, et al. [pdf]
- Coprotector: Protect open-source code against unauthorized training usage with data poisoning (2022), WWW, Z Sun, X Du, F Song, et al. [pdf]
- On the Security Vulnerabilities of Text-to-SQL Models (2211), ISSRE, X Peng, Y Zhang, J Yang, et al. [pdf]
- Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions (2022), S&P, H Pearce, B Ahmad, B Tan, et al. [pdf]
- Automated repair of programs from large language models (2023), ICSE, Z Fan, X Gao, M Mirchev, et al. [pdf]
- Cctest: Testing and repairing code completion systems (2023), ICSE, Z Li, C Wang, Z Liu, et al. [pdf]
- Analyzing Leakage of Personally Identifiable Information in Language Models (2023), S&P, N Lukas, A Salem, R Sim, et al. [pdf]
- CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot (2023), USENIX Security, L Niu, S Mirza, Z Maradni, et al. [pdf]
-
ACCENT (Adversarial Code Comment gENeraTor)
-
AutoTransform
-
Functionality-generalization
-
CD-VulD
- a new system for Cross Domain Software Vulnerability Discovery using deep learning (DL) and domain adaptation (DA).
- learn cross-domain representations
- TDSC 2022
- [pdf]
-
DietCode
-
BinUSE
- a practical and efficient equivalence check, using under-constrained symbolic execution (USE)
- TSE 2023
- [pdf]
-
mmd
-
Metropolis-Hastings Modifier (MHM)
-
ReVeal
-
ghost-dl
-
HAN
- Hierarchical Attention Network
- multiple structural code features (including control flow graph and AST) to reflect the code hierarchy, a two-layer attention network (a token layer and a statement layer)
- TSE 2022
- [pdf]
-
RobustTrainer
-
SYNSHINE
-
CARROTA
-
CAT
-
apr4codex
-
CCTEST
- test and repair code completion systems in black-box setting
- ICSE 2023
- [pdf], [HomePage]
-
ContraBERT
- an approach aims to improve the robustness of pre-trained models via contrastive learning, Contrastive Learning
- ICSE 2023
- [pdf], [HomePage]
-
REPEAT
-
RepresentThemAll
-
TENURE
-
CREAM
-
Telly
-
TEval+
-
analysing_pii_leakage
-
CodexLeaks
-
CoProtector
-
LLaMA-Reviewer
-
Compressor
-
NNGen
-
Actor-critic network
- an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework
- ASE 2018
- [pdf]
-
CugLM
- a multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture
- ASE 2021
- [pdf]
-
SDA-Trans
- adversarial training, unsupervised training
- a syntax and domain-aware model for program translation, which leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability
- ICSE 2023
- [pdf]
-
Tare
-
Interpreting Deep Learning-based Vulnerability Detector Predictions Based on Heuristic Searching
- a framework for interpreting predictions of deep learning-based vulnerability detectors
- The framework is centered at identifying a small number of tokens that make important contributions to a particular prediction, the novelty of the framework can be characterized as follows: (1) it does not assume the detector’s local decision boundary is linear; (2) it does not assume the features are independent of each other but instead braces the association between features when searching for important features; (3) it searches important features by perturbing examples, while considering feature combinations rather than individual features.
- TOSEM 2021
- [pdf]
-
Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code
-
Vulnerability detection with fine-grained interpretations
-
What do they capture? a structural analysis of pre-trained language models for source code
-
ReCode: Robustness Evaluation of Code Generation Models
-
Deep Learning Based Program Generation From Requirements Text: Are We There Yet?
-
Deep Learning Based Vulnerability Detection: Are We There Yet?
- [Chromium_And_Debian_Vulnerability_Data], curated a real-world dataset from developer/user reported vulnerabilities of Chromium and Debian projects
- TSE 2021
- [pdf]
-
Generating realistic vulnerabilities via neural code editing: an empirical study
- [SARD], Synthetic dataset
- [Real-world dataset], a real-world dataset based on BigVul and PatchDB
- FSE 2022
- [pdf]
-
ReCode: Robustness Evaluation of Code Generation Models
- robustness evaluation metrics for code-generation tasks: Robust Passs@k, Robust Drops@k, and Robust Relatives@k
- ACL 2023
- [pdf]
- AI Domain
- AAAI, the Association for the Advancement of Artificial Intelligence
- ACL, the Association for Computational Linguistics
- SE Domain
- ICSE, the International Conference on Software Engineering
- FSE, Symposium on the Foundations of Software Engineering
- ASE, the International Conference on Automated Software Engineering
- ISSTA, the International Symposium on Software Testing and Analysis
- ISSRE, IEEE International Symposium on Software Reliability
- SANER, IEEE International Conference on Software Analysis, Evolution, and Reengineering Engineering
- OOPSLA, the ACM Conference on Systems, Programming, Languages, and Applications
- Security Domain
- S&P, IEEE Symposium on Security and Privacy
- USENIX Security, USENIX Security Symposium
- Internet and Web technology Domain
- WWW, International World Wide Web Conference
- SE Domain
- TSE, the IEEE Transactions on Software Engineering
- TOSEM, ACM Transactions on Software Engineering and Methodology
- JSS, Journal of Systems and Software
- Security Domain
- TDSC, IEEE Transactions on Dependable and Secure Computing
- TIFS, IEEE Transactions on Information Forensics and Security