From 6bfd9727c0af71dd8a3224d132e1e693be8c2bd4 Mon Sep 17 00:00:00 2001 From: maudehrmann Date: Wed, 20 Jan 2021 17:14:57 +0100 Subject: [PATCH] ajustements --- data/README.md | 57 ++++++++++++++++++++++++++++++++------------------ 1 file changed, 37 insertions(+), 20 deletions(-) diff --git a/data/README.md b/data/README.md index 8006b2d..2bfcf71 100644 --- a/data/README.md +++ b/data/README.md @@ -1,35 +1,52 @@ -## About: +## Combining textual and visual features for newspaper article segmentation: Datasets & Models -**Datasets and models** related to the experiments on combining textual and visual features for newspaper article segmentation. +### Image annotations +The folder contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in [VIA](http://www.robots.ox.ac.uk/~vgg/software/via/) format (v2.0.10). -**Zenodo record:** (upcoming) 10.5281/zenodo.4065271 +The following licenses apply: +- `luxwort.json`: those annotations are under a [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/legalcode) license. Please refer to the right statement specified for each image in the JSON file. -### Annotations -The folder contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in [VIA](http://www.robots.ox.ac.uk/~vgg/software/via/) format. Depending on the newspaper, those annotations are under license [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) or [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Please refer to the rights statements in each file. -### Images -The images are released as an asset to the current Github release in a zip file. It contains the files of the Swiss titles (GDL, IMP, JDG). Those images are under copyright but can be used for research purposes (redistribution, publication or commercial use are ***not*** permitted). Images of the Luxembourgish title are available through the IIIF endpoint of the National Library of Luxembourg. Please refer to the rights statements and information in each file. +- `GDL.json`, `IMP.json` and `JDG.json`: those annotations are under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license. + + + +### Image files +- Images of Swiss titles (GDL, IMP, JDG) are released as an asset of the current Github [release](https://github.com/dhlab-epfl/dhSegment-text/releases/tag/0.1), in the `images.zip` archive. + **Terms of use**: those images are under copyright but can be used for research purposes only. Redistribution, publication or commercial use are ***not*** permitted. + +- Images of the Luxembourgish title are available through the IIIF endpoint of the National Library of Luxembourg (see URL in the annnotation file `luxwort.json`). + + + +### Trained models + +Some of the best models are released as assets of the current Github [release](https://github.com/dhlab-epfl/dhSegment-text/releases/tag/0.1) in zip files. + +- **JDG_flair-FT**: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (`Serial`, `Weather`, `Death notice` and `Stocks`). +- **Luxwort_obituary_flair-bpemb**: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the `Death notice` class. +- **Luxwort_obituary_flair-FT_indomain**: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the `Death notice` class. + +Those models can be used to predict probabilities on new images using the same code as in the original repository. +One needs to adjust three parameters to the `predict` function: 1) `embeddings_path` (the path to the embeddings list), 2) `embeddings_map_path`(the path to the compressed embedding map), and 3) `embeddings_dim` (the size of the embeddings). + +Models are available under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license. Please refer to the [paper](https://github.com/dhlab-epfl/dhSegment-text#paper) for further information or contact us. + + + +### DOI + +[https://doi.org/10.5281/zenodo.3706863](https://doi.org/10.5281/zenodo.3706863) -### trained-models -The models are released as assets of the current Github release in corresponding zip files. They contains some of the best models, as described in the corresponding paper. Available under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license. -They can be used to predict probabilities on new images using the same code as in the original repository. -The only difference is the need for three additional parameters to the `predict` function, `embeddings_path` which the path to the embeddings list, -`embeddings_map_path` which is the path to the compressed embedding map and `embeddings_dim` which is the size of the embeddings. -The following models are shared: -- **JDG_flair-FT**: trained on JDG using french Flair and FastText embeddings and is able to predict -the four classes presented in the paper (`Serial`, `Weather`, `Death notice` and `Stocks`). -- **Luxwort_obituary_flair-bpemb**: trained on Luxwort using multilingual Flair and Byte-pair embeddings and is able to predict only the `Death notice` class. -- **Luxwort_obituary_flair-FT_indomain**: trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data) and is also only able to -predict the `Death notice` class. +## Acknowledgements +We warmly thank the journal [Le Temps](https://letemps.ch) (owner of *La Gazette de Lausanne* and the *Journal de Genève*) and the group [ArcInfo](https://www.arcinfo.ch/) (owner of *L'Impartial*) for accepting to share the related datasets for academic purposes. We also thank the [National Library of Luxembourg](https://bnl.public.lu/fr.html) for its support with all steps related to the *Luxemburger Wort* annotation release. -## Acknowledgements: -We warmly thank the journal [Le Temps](https://letemps.ch) (owner of *La Gazette de Lausanne* and the *Journal de Genève*) and the group [ArcInfo](https://www.arcinfo.ch/) (owner of *L'Impartial*) for accepting to share the related datasets for academic purposes. We also thank the [National Library of Luxembourg](https://bnl.public.lu/fr.html) for its work and support with all steps related to the *Luxemburger Wort* annotation release.