Code to reproduce experiments in:
@article{BelhdImprovinGarda2024,
archiveprefix = {arXiv},
author = {Garda, Samuele and Leser, Ulf},
eprint = {2401.05125v1},
month = {Jan},
primaryclass = {cs.CL},
title = {BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation},
url = {http://arxiv.org/abs/2401.05125v1},
year = {2024},
}
Install the belb library in your python environment:
git clone https://github.com/sg-wbi/belb
cd belb
pip install -e .
Then you need to install other requirements specific for BELHD:
(belhd) user $ pip install -r requirements.txt
We stored predictions of all models and gold labels in the data
directory.
Below you find the commands to reproduce all tables reported in the paper.
Reproduce the results on BELB.
Main table:
(belhd) user $ python -m scripts.evaluate
BELHD ablations:
(belhd) user $ python -m scripts.evaluate_ablations
Ad-hoc solutions for homonyms. Abbreviations:
(belhd) user $ python -m scripts.evaluate_ar
and species assignment:
(belhd) user $ python -m scripts.evaluate_sa
(belhd) user $ python -m biored.evaluate
If you wish to use our code with BELB you first need to follow the belb
instructions to setup a directory with all the data (corpora and KBs).
To create KB versions with disambiguated homonyms:
(belhd) user $ python -m scripts.disambiguate_kbs --dir /path/to/belb/dir
We note that belb
deals with large KBs and its code it's not optimized.
This step takes quite a while, especially for NCBI Gene.
To train BELHD you need to convert BELB data into the required input format
Edit data/configs/data.yaml
:
belb_dir : 'path/to/belb/directory'
exp_dir : 'path/to/experiments/directory'
Prepare data with:
(belhd) user $ python -m scripts.tokenize_corpora
and
(belhd) user $ python -m scripts.tokenize_dkbs
Then you can use the helpers scripts bin/train.sh
to train the models and bin/predict.sh
to obtain the predictions
for each corpus.
Run scripts bin/train_ablations.sh
and bin/predict_ablations.sh
You need to first train BELHD without HD and with abbreviation resolution (bin/train_nohd.sh
) and obtain the predictions (bin/predict_nohd.sh
).
For this you need to create a version of the data with abbreviation resolution with:
(belhd) user $ python -m scripts.tokenize_corpora abbres=true
Similarly you need to rerun the baselines with abbreviation resolution.
Gene corpora with species assignment are stored in ./data/belb/species_assign
(see SpeciesAssignment.md for details).
For each baseline we use the original code. We provide detailed instruction on how to run them in separate files:
- BioSyn: ./baselines/biosyn/README.md
- GenBioEL: ./baselines/genbioel/README.md
- arboEL: ./baselines/arboel/README.md