Skip to content

Latest commit

 

History

History
60 lines (44 loc) · 2.56 KB

README.md

File metadata and controls

60 lines (44 loc) · 2.56 KB

Named Entity Recognition for Danish Legal Texts

Repository for supervised and few-shot named entity recognition in the danish legal domain.

Data

This repository provides datasets for named entity recognition in the danish legal domain in the data folder:

  1. danish_legal_ner_dataset.conll
  • train.conll

This dataset is used for supervised named entity recognition and consists of 2415 sentences annotated with 8 named entities:

  • Organisation
  • Person
  • Dato
  • Lokation
  • Lov
  • Retsinstans
  • Dommer
  • Advokat

The train.conll file is a processed version of danish_legal_ner_dataset.conll.

  1. few_shot_new_dataset.conll
  • test.conll

This dataset is used for evaluating the few-shot named entity recognition algorithm in the danish legal domain and consists of 1480 sentences annotated with 5 named entities:

  • Land
  • By
  • Retspraksis
  • Litteratur
  • Sagsnummer

The test.conll file is a processed version of few_shot_ner_dataset.conll.

Usage

The src folder provides scripts for training and evaluating supervised and few-shot named entity recognition models.

Install requirements:

pip install -r requirements.txt

To fine-tune a transformer model for supervised named entity recognition on the dataset danish_legal_ner_dataset.conll, run the following script:

python train_ner.py --labels ../data/labels.txt --data ../data/train.conll --checkpoint <huggingface-remote-or-local-model-checkpoint> --output_dir <path-to-output-dir> --batch_size <batch_size> --lr <lr> --epochs <epochs>

The script datasampler.py provides an algorithm to sample arbitrary N-Way K-Shot support sets for few-shot named entity recognition. The algorithm also outputs a file with remaining sentences not sampled in support set to use as query set. For example to sample a 2-Way 5-Shot support and query set from the dataset few_shot_new_dataset.conll run the following:

python datasampler.py --n 2 --k 5 --datafile ../data/test.conll

To evaluate a few-shot algorithm based on StructShot using a sampled support and query set run the following:

python fewshot.py --data_dir ../data/ --labels ../data/labels.txt --target_labels ../data/labels_few_shot.txt --train_fname train --sup_fname <path-to-support-set-file> --test_fname <path-to-query-set-file> --model_name_or_path <huggingface-model-name> --checkpoint huggingface-remote-or-local-model-checkpoint<> --output_dir <path-to-output-dir> --algorithm StructShot --gpus 1 --eval_batch_size <eval_batch_size>

Few-shot algorithm is based on structshot algorithm, which can be found here: https://github.com/asappresearch/structshot