Skip to content

An improved tool for named entity recognition for Polish based on deep learning.

Notifications You must be signed in to change notification settings

CLARIN-PL/PolDeepNer2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PolDeepNer2

PolDeepNer2 is an improved version of PolDeepNer. The tool is designed to recognize and categorize named entities utilizing neural networks and transfomer-based language models.

The tool is provided with a list of pre-trained models for Polish and other languages.

It contains a pre-trained model trained on the NKJP corpus which recognizes nested annotations of the following types:

Contributors

Notebooks

notebooks/pdn2_cpu.py
This notebook present how to install and use module API to process a raw text on CPU.

Models

PolEval 2018 (NKJP NER model)

PolDeepNer2 achieves the SOTA results on the PolEval 2018 dataset.

NKJP NER categories

Model Score F1 Overlap F1 Exact Score main Time CPU Time GPU Source
PolDeepNer2
HerBERT large, spacy-ext, sq 92.1 92.7 89.9 ~2m 24s
Polish RoBERTa base, spacy-ext, sq 91.4 91.9 89.1 ~1.5 h ~2m 8s
Polish RoBERTa base, toki 90.0 90.5 87.7 92.40 ~6h 30m ~6m 30s
Polish RoBERTa base, spacy-ext 89.8 90.4 87.4 92.20 ~8m 2s
Systems published after PolEval 2018
Dadas et al. 2020 [1] 88.6 87.0 89.0 - - - link
Polish RoBERTa (large) [1] - - - 89.98 - - link
Polish RoBERTa (base) [1] - - - 87.94 - - link
spaCy (pl_spacy_model) - - - 87.50 ~3m - link
Top 3 systems from PolEval 2018
Applica.ai 86.6 87.7 82.6 - - - link
PolDeepNer 85.1 85.9 82.2 - - ~9m link
Liner2 81.0 81.8 77.8 - ~3m - link

[1] The model is not available. Only the evaluation results were published.

Comparision of loading and processing times

Model Library Tokenizer Model loading [s] Preprocessing [s] NE recognition [s] Total [s]
Polish RoBERTa base fairseq - 12.28 50.90 65.23 128.4
HerBERT large HuggingFace HerbertTokenizerFast 18.44 50.83 103.70 173.0
HerBERT large HuggingFace XLMTokenizer 18.33 51.42 177.50 247.3
  • Dataset size: 1828 document (3M characters).
  • GPU: RTX Titan (24 GB, 4608 CUDA cores).

NKJP NER times

Comparision of named entity recognition times for different datasets

Size [Million chars] NER time [minutes]
PolEval 2018 NER test dataset 3 2.6
Monthly volume of news from Polish news portals [70 sources] 160 136.9
Polish Wikipedia (2013 dump) 1000 855.6
Annual volume of news from Polish news portals [70 sources] 1920 1642.7

NKJP NER times

N82 (KPWr and CEN)

Inner-corpora evaluation

Model Eval Precision Recall F-measure Support Source
PolDeepNer2 (kpwr_n82_base) KPWr 75.02 77.67 76.32 4430
PolDeepNer2 (kpwr_n82_large) KPWr 77.05 78.79 77.91 4430
PolDeepNer (n82-elmo-kgr10) KPWr 73.97 75.49 74.72 4430 link
---
PolDeepNer2 (cen_n82_base) CEN 84.64 85.95 85.29 1423
PolDeepNer2 (cen_n82_large) CEN 86.94 88.40 87.67 1423

Cross-corpora evaluation

Model Eval Precision Recall F-measure Support
PolDeepNer2 (kpwr_n82_base) CEN 80.90 81.87 81.38 1423
PolDeepNer2 (kpwr_n82_large) CEN 80.16 82.08 81.11 1423
---
PolDeepNer2 (cen_n82_base) KPWr 58.58 64.79 61.53 4430
PolDeepNer2 (cen_n82_large) KPWr 61.38 66.66 63.91 4430

Installation (with Conda)

Create and activate conda environment:

conda create -n pdn2 python=3.6
conda activate pdn2

Install CUDA, CuDNN and Torch:

conda install -c anaconda cudatoolkit=10.1
conda install -c anaconda cudnn

Install PolDeepNer2:

pip install https://pypi.clarin-pl.eu/packages/poldeepner2-0.5.0-py3-none-any.whl#md5=6a6131d1b3d104f0bbed87ec6969a841

Install spacy model

python -m spacy download pl_core_news_sm

Evaluation

Download evaluation dataset

wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O POLEVAL-NER_GOLD.json

Polish RoBERTa

Process the dataset:

python process_poleval.py \
  --input POLEVAL-NER_GOLD.json \
  --output pdn2_nkjp_roberta_base_sq.json \
  --model nkjp-base-sq \
  --device cuda:0

Output:

Model loading time          :    12.28 second(s)
Data preprocessing time     :     50.9 second(s)
Data NE recognition time    :    65.23 second(s)
Total time                  :    128.4 second(s)
Data size:                  :    3.072M characters

Evaluate:

python poleval_ner_test.py \
  --goldfile POLEVAL-NER_GOLD.json \
  --userfile pdn2_nkjp_roberta_base_sq.json

Output:

OVERLAP precision: 0.927 recall: 0.912 F1: 0.919 
EXACT precision: 0.899 recall: 0.884 F1: 0.891 
Final score: 0.914
Exact TP=32971 ; FP=3709; FN=4335

HerBERT

Process the dataset:

python process_poleval.py \
  --input POLEVAL-NER_GOLD.json \
  --output pdn2_nkjp_herbert_large_sq.json \
  --model nkjp-herbert-large-sq \
  --device cuda:0

Output:

Model loading time          :    18.44 second(s)
Data preprocessing time     :    50.83 second(s)
Data NE recognition time    :    103.7 second(s)
Total time                  :    173.0 second(s)
Data size:                  :    3.072M characters

Evaluate:

python poleval_ner_test.py \
  --goldfile POLEVAL-NER_GOLD.json \
  --userfile pdn2_nkjp_herbert_large_sq.json

Output:

OVERLAP precision: 0.929 recall: 0.922 F1: 0.926 
EXACT precision: 0.903 recall: 0.896 F1: 0.900 
Final score: 0.921
Exact TP=33433 ; FP=3596; FN=3873

Credits

About

An improved tool for named entity recognition for Polish based on deep learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published