PolDeepNer2

PolDeepNer2 is an improved version of PolDeepNer. The tool is designed to recognize and categorize named entities utilizing neural networks and transfomer-based language models.

The tool is provided with a list of pre-trained models for Polish and other languages.

It contains a pre-trained model trained on the NKJP corpus which recognizes nested annotations of the following types:

Contributors

Michał Marcińczuk [email protected]
Jarema Radom

Notebooks

notebooks/pdn2_cpu.py

This notebook present how to install and use module API to process a raw text on CPU.

Models

PolEval 2018 (NKJP NER model)

PolDeepNer2 achieves the SOTA results on the PolEval 2018 dataset.

Model	Score	F1 Overlap	F1 Exact	Score main	Time CPU	Time GPU	Source
PolDeepNer2
HerBERT large, spacy-ext, sq	92.1	92.7	89.9			~2m 24s
Polish RoBERTa base, spacy-ext, sq	91.4	91.9	89.1		~1.5 h	~2m 8s
Polish RoBERTa base, toki	90.0	90.5	87.7	92.40	~6h 30m	~6m 30s
Polish RoBERTa base, spacy-ext	89.8	90.4	87.4	92.20		~8m 2s
Systems published after PolEval 2018
Dadas et al. 2020 [1]	88.6	87.0	89.0	-	-	-	link
Polish RoBERTa (large) [1]	-	-	-	89.98	-	-	link
Polish RoBERTa (base) [1]	-	-	-	87.94	-	-	link
spaCy (pl_spacy_model)	-	-	-	87.50	~3m	-	link
Top 3 systems from PolEval 2018
Applica.ai	86.6	87.7	82.6	-	-	-	link
PolDeepNer	85.1	85.9	82.2	-	-	~9m	link
Liner2	81.0	81.8	77.8	-	~3m	-	link

[1] The model is not available. Only the evaluation results were published.

Comparision of loading and processing times

Model	Library	Tokenizer	Model loading [s]	Preprocessing [s]	NE recognition [s]	Total [s]
Polish RoBERTa base	fairseq	-	12.28	50.90	65.23	128.4
HerBERT large	HuggingFace	HerbertTokenizerFast	18.44	50.83	103.70	173.0
HerBERT large	HuggingFace	XLMTokenizer	18.33	51.42	177.50	247.3

Dataset size: 1828 document (3M characters).
GPU: RTX Titan (24 GB, 4608 CUDA cores).

Comparision of named entity recognition times for different datasets

	Size [Million chars]	NER time [minutes]
PolEval 2018 NER test dataset	3	2.6
Monthly volume of news from Polish news portals [70 sources]	160	136.9
Polish Wikipedia (2013 dump)	1000	855.6
Annual volume of news from Polish news portals [70 sources]	1920	1642.7

N82 (KPWr and CEN)

Inner-corpora evaluation

Model	Eval	Precision	Recall	F-measure	Support	Source
PolDeepNer2 (kpwr_n82_base)	KPWr	75.02	77.67	76.32	4430
PolDeepNer2 (kpwr_n82_large)	KPWr	77.05	78.79	77.91	4430
PolDeepNer (n82-elmo-kgr10)	KPWr	73.97	75.49	74.72	4430	link
---
PolDeepNer2 (cen_n82_base)	CEN	84.64	85.95	85.29	1423
PolDeepNer2 (cen_n82_large)	CEN	86.94	88.40	87.67	1423

Cross-corpora evaluation

Model	Eval	Precision	Recall	F-measure	Support
PolDeepNer2 (kpwr_n82_base)	CEN	80.90	81.87	81.38	1423
PolDeepNer2 (kpwr_n82_large)	CEN	80.16	82.08	81.11	1423
---
PolDeepNer2 (cen_n82_base)	KPWr	58.58	64.79	61.53	4430
PolDeepNer2 (cen_n82_large)	KPWr	61.38	66.66	63.91	4430

Installation (with Conda)

Create and activate conda environment:

conda create -n pdn2 python=3.6
conda activate pdn2

Install CUDA, CuDNN and Torch:

conda install -c anaconda cudatoolkit=10.1
conda install -c anaconda cudnn

Install PolDeepNer2:

pip install https://pypi.clarin-pl.eu/packages/poldeepner2-0.5.0-py3-none-any.whl#md5=6a6131d1b3d104f0bbed87ec6969a841

Install spacy model

python -m spacy download pl_core_news_sm

Evaluation

Download evaluation dataset

wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O POLEVAL-NER_GOLD.json

Polish RoBERTa

Process the dataset:

python process_poleval.py \
  --input POLEVAL-NER_GOLD.json \
  --output pdn2_nkjp_roberta_base_sq.json \
  --model nkjp-base-sq \
  --device cuda:0

Output:

Model loading time          :    12.28 second(s)
Data preprocessing time     :     50.9 second(s)
Data NE recognition time    :    65.23 second(s)
Total time                  :    128.4 second(s)
Data size:                  :    3.072M characters

Evaluate:

python poleval_ner_test.py \
  --goldfile POLEVAL-NER_GOLD.json \
  --userfile pdn2_nkjp_roberta_base_sq.json

Output:

OVERLAP precision: 0.927 recall: 0.912 F1: 0.919 
EXACT precision: 0.899 recall: 0.884 F1: 0.891 
Final score: 0.914
Exact TP=32971 ; FP=3709; FN=4335

HerBERT

Process the dataset:

python process_poleval.py \
  --input POLEVAL-NER_GOLD.json \
  --output pdn2_nkjp_herbert_large_sq.json \
  --model nkjp-herbert-large-sq \
  --device cuda:0

Output:

Model loading time          :    18.44 second(s)
Data preprocessing time     :    50.83 second(s)
Data NE recognition time    :    103.7 second(s)
Total time                  :    173.0 second(s)
Data size:                  :    3.072M characters

Evaluate:

python poleval_ner_test.py \
  --goldfile POLEVAL-NER_GOLD.json \
  --userfile pdn2_nkjp_herbert_large_sq.json

Output:

OVERLAP precision: 0.929 recall: 0.922 F1: 0.926 
EXACT precision: 0.903 recall: 0.896 F1: 0.900 
Final score: 0.921
Exact TP=33433 ; FP=3596; FN=3873

Credits

This code is based on xlm-roberta-ner by mohammadKhalifa.
Language models for Polish:

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs/media		docs/media
notebooks		notebooks
README.md		README.md
poleval_ner_test.py		poleval_ner_test.py
process_poleval.py		process_poleval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolDeepNer2

Contributors

Notebooks

Models

PolEval 2018 (NKJP NER model)

Comparision of loading and processing times

Comparision of named entity recognition times for different datasets

N82 (KPWr and CEN)

Installation (with Conda)

Evaluation

Polish RoBERTa

HerBERT

Credits

About

Releases

Packages

Languages

CLARIN-PL/PolDeepNer2

Folders and files

Latest commit

History

Repository files navigation

PolDeepNer2

Contributors

Notebooks

Models

PolEval 2018 (NKJP NER model)

Comparision of loading and processing times

Comparision of named entity recognition times for different datasets

N82 (KPWr and CEN)

Installation (with Conda)

Evaluation

Polish RoBERTa

HerBERT

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages