Language models like CodeBERT have advanced source code representation learning, but their opacity poses barriers to understanding what properties they capture. Recent attention analysis studies provide initial interpretability insights by focusing solely on attention weights rather than considering the complete context modeling of Transformers. The goal of this study is to shed some light on the previously ignored factors of the attention mechanism, beyond the attention weights. We conduct an initial empirical study analyzing both attention distributions and transformed representations in CodeBERT. Across two programming languages, Java and Python, we find that the scaled transformation norms of the input better capture syntactic structure compared to attention weights alone. Our analysis reveals characterization of how CodeBERT embeds syntactic code properties. The findings demonstrate the importance of incorporating factors beyond just attention weights for rigorously understanding neural code models. This lays the groundwork for developing more interpretable models and effective uses of attention mechanisms in program analysis.
The purpose of the artifact is to replicate the results stated in the paper. The main goal of this study is to shed some light on the previously ignored factors of the attention mechanism, beyond the attention weights. Through this artifact evaluation we would like to apply for the badges Available and Reusable. The reasons we believe that these badges are relevant to this artifact are as follow; first, all of the source code that is used in this artifact is publicly available. It include code that was used to generate and transform input data and the logic for extracting the attention and scaled attention maps. Second, the code is written in a way that it can be easily replicated by other researchers by following the steps that are mentioned later in the README.
The preprint of the paper is available here, as for the code and data, they are available is this repository.
This repository contains the code and data needed to replicate the study. The main folders of this repository is described as follow,
-
custom_transformers
: This folder contains the modified implementation ofmodelling_roberta.py
from thetransformers
library, based on Kobayashi et al. work. It is used to generate maps of$||f(x)||$ and$||\alpha f(x)||$ along with the attention maps from each head. -
data
: Contains the input data that was used in this study. We directly provide the input data for convenience, but, can generate it from scratsh following the steps that will be mentioned later. -
notebooks
: Notebooks that were used to preprocess the raw data into input data are located here. -
raw_data
: Data that will be preprocessed to obtain the input data. As mentioned in the paper, we used two corpora (Python and Java) from CodeSearchNet (CSN). The raw data for the Java corpus is seeded by randomly sampling 65K samples from CSN. You can download the set from the original repository here and sample your own seed, or you can directly use our provided samplecsn_java_65k.jsonl
. As for the Python dataset (train.json
), we used Wan et al.'s given that it already contains the property map. -
scripts
: The implementation of the logic needed to extract data to answer RQ1 and RQ2. -
visualization
: Includes the notebooks needed to generate figures 2, 3 and 4 in the manuscript. -
results
: The destination folder where the results of RQ1 and RQ2 will be saved.java
andpython
will contain the results of RQ1, whereas the root folder (i.e.,results
) will contain the results of RQ2. We also provide the results data direclty for convenience.
For this small, self-contained codebase, virtualenv provides a lightweight way to manage dependencies and distribute the code. Docker or VMs would add unnecessary complexity. Virtualenv encapsulates the Python environment without the configuration and size overhead of containers or VMs. It ensures portability and reproducibility while maintaining simplicity for this study.
First, create virtual environment to install the dependencies,
python3 -m venv norm_analysis
Second, activate the virtual environment (if you are using a Linux distribution or macOS),
source ~/norm_analysis/bin/activate
In case you are using a Windows OS,
norm_analysis\Scripts\activate
After executing these commands, you should see (norm_analysis)
in your terminal, and a folder with the same name created at the root level of the repository.
Then, these dependencies can be installed by executing,
pip install -r requirements.txt
Note that we scienceplots
to plot the graphs of RQ1. This entails that you need to install LaTex. If you are using MacOS, you can use MacTex. As for Windows OS, you can use MiKTeX. Linux (Ubuntu) users can install Tex and its requirements as such:
sudo apt-get install dvipng texlive-latex-extra texlive-fonts-recommended cm-super
The final step is to add the virtual enviornment to Jupyter Notebook. This is done by executing the following command. Make sure that the virtual enviornment is activated when doing so.
pip install --user ipykernel
python -m ipykernel install --user --name=norm_analysis
If you face an issue while executing the two above commands, try setting include-system-site-packages
in norm_analysis/config.cfg
to true
.
To execute the notebooks, execute the following command,
jupyter notebook
and do not forget to choose the added virtual enviornment
The hardware specification used for this study is as follows,
- OS: Debian GNU/Linux 11 (bullseye)
- CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
- GPU: NVIDIA GeForce RTX 2080 (8GB VRAM)
- RAM: 32 GB
- Storage: 10 GB
Most of the logic is implemented in scripts/RQ1.py
. It is used to generate the data needed for both RQs. We only perform a one forward pass (i.e., inference), and no backpropagation is needed. It took approximately 25 minutes to run the RQ1 script under the hardware mentioned above, so it is expected to take less if it is run on a higher end GPU. Note that a GPU is not required to execute this artefact which can be fully executed in a CPU. However, this would increase the overall execution time.
The total size of data that includes raw data, processed input data and model weights (downloaded in the background by huggingface
library) is around 1.5 GB. So, a machine with a storage of 10 GB should be sufficient to run the study.
In this step we performed some minor data preprocessing. The major preprocessing steps are: removing comments (line and block), and discarding samples that cannot be parsed by code_tokenize which constituted code_tokenize
to tokenize source code into tokens and group them by category.
Execute the notebooks Preparing_CSN_Java.ipynb
and Preparing_CSN_Python.ipynb
to generate the data needed for the study. Upon successfull execution, 5k_csn_java.jsonl
and 5k_csn_python.jsonl
will be generated and stored in the data
folder. Note that these files are already included if you want to skip this step.
To generate the data needed to answer RQ1 and RQ2, execute the following command,
python3 -m scripts.RQx
where x
is 1
or 2
depending on the RQ. The results of RQ1 will be stored in results/java
and results/python
, whereas the results of RQ2 will be directly saved as .pickle
files in results
.
Note that for RQ1, you will need to update L129 with the raw file containing Java sampels (data/5k_csn_java.jsonl
) and set the variable lang
L131 to java
in the file scripts/RQ1.py
Finally, to generate the visualizations that were used to study the difference between the components of the MHA (i.e., RQ1.ipynb
and RQ2.ipynb
located in the visualization
folder.
All figure generated and presented in the manuscript are also provided in the figures
folder.
Python | Java |
---|---|
![]() |
![]() |
![]() |
![]() |
Attention Weights | Scaled Norms |
---|---|
![]() |
![]() |