Skip to content

Commit

Permalink
ajustements
Browse files Browse the repository at this point in the history
  • Loading branch information
e-maud committed Jan 20, 2021
1 parent d5728dc commit 6bfd972
Showing 1 changed file with 37 additions and 20 deletions.
57 changes: 37 additions & 20 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,52 @@
## About:
## Combining textual and visual features for newspaper article segmentation: Datasets & Models

**Datasets and models** related to the experiments on combining textual and visual features for newspaper article segmentation.


### Image annotations
The folder contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in [VIA](http://www.robots.ox.ac.uk/~vgg/software/via/) format (v2.0.10).

**Zenodo record:** (upcoming) 10.5281/zenodo.4065271
The following licenses apply:
- `luxwort.json`: those annotations are under a [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/legalcode) license. Please refer to the right statement specified for each image in the JSON file.

### Annotations
The folder contains image annotations, with one file per newspaper containing region annotations (label and coordinates) in [VIA](http://www.robots.ox.ac.uk/~vgg/software/via/) format. Depending on the newspaper, those annotations are under license [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) or [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Please refer to the rights statements in each file.
### Images
The images are released as an asset to the current Github release in a zip file. It contains the files of the Swiss titles (GDL, IMP, JDG). Those images are under copyright but can be used for research purposes (redistribution, publication or commercial use are ***not*** permitted). Images of the Luxembourgish title are available through the IIIF endpoint of the National Library of Luxembourg. Please refer to the rights statements and information in each file.
- `GDL.json`, `IMP.json` and `JDG.json`: those annotations are under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license.



### Image files
- Images of Swiss titles (GDL, IMP, JDG) are released as an asset of the current Github [release](https://github.com/dhlab-epfl/dhSegment-text/releases/tag/0.1), in the `images.zip` archive.
**Terms of use**: those images are under copyright but can be used for research purposes only. Redistribution, publication or commercial use are ***not*** permitted.

- Images of the Luxembourgish title are available through the IIIF endpoint of the National Library of Luxembourg (see URL in the annnotation file `luxwort.json`).



### Trained models

Some of the best models are released as assets of the current Github [release](https://github.com/dhlab-epfl/dhSegment-text/releases/tag/0.1) in zip files.

- **JDG_flair-FT**: this model was trained on JDG using french Flair and FastText embeddings. It is able to predict the four classes presented in the paper (`Serial`, `Weather`, `Death notice` and `Stocks`).
- **Luxwort_obituary_flair-bpemb**: this model was trained on Luxwort using multilingual Flair and Byte-pair embeddings. It is able to predict the `Death notice` class.
- **Luxwort_obituary_flair-FT_indomain**: this model was trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data). It is also able to predict the `Death notice` class.

Those models can be used to predict probabilities on new images using the same code as in the original repository.
One needs to adjust three parameters to the `predict` function: 1) `embeddings_path` (the path to the embeddings list), 2) `embeddings_map_path`(the path to the compressed embedding map), and 3) `embeddings_dim` (the size of the embeddings).

Models are available under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license. Please refer to the [paper](https://github.com/dhlab-epfl/dhSegment-text#paper) for further information or contact us.



### DOI

[https://doi.org/10.5281/zenodo.3706863](https://doi.org/10.5281/zenodo.3706863)

### trained-models

The models are released as assets of the current Github release in corresponding zip files. They contains some of the best models, as described in the corresponding paper. Available under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.

They can be used to predict probabilities on new images using the same code as in the original repository.
The only difference is the need for three additional parameters to the `predict` function, `embeddings_path` which the path to the embeddings list,
`embeddings_map_path` which is the path to the compressed embedding map and `embeddings_dim` which is the size of the embeddings.

The following models are shared:
- **JDG_flair-FT**: trained on JDG using french Flair and FastText embeddings and is able to predict
the four classes presented in the paper (`Serial`, `Weather`, `Death notice` and `Stocks`).
- **Luxwort_obituary_flair-bpemb**: trained on Luxwort using multilingual Flair and Byte-pair embeddings and is able to predict only the `Death notice` class.
- **Luxwort_obituary_flair-FT_indomain**: trained on Luxwort using in-domain Flair and FastText embeddings (trained on Luxwort data) and is also only able to
predict the `Death notice` class.
## Acknowledgements

We warmly thank the journal [Le Temps](https://letemps.ch) (owner of *La Gazette de Lausanne* and the *Journal de Genève*) and the group [ArcInfo](https://www.arcinfo.ch/) (owner of *L'Impartial*) for accepting to share the related datasets for academic purposes. We also thank the [National Library of Luxembourg](https://bnl.public.lu/fr.html) for its support with all steps related to the *Luxemburger Wort* annotation release.

## Acknowledgements:

We warmly thank the journal [Le Temps](https://letemps.ch) (owner of *La Gazette de Lausanne* and the *Journal de Genève*) and the group [ArcInfo](https://www.arcinfo.ch/) (owner of *L'Impartial*) for accepting to share the related datasets for academic purposes. We also thank the [National Library of Luxembourg](https://bnl.public.lu/fr.html) for its work and support with all steps related to the *Luxemburger Wort* annotation release.



Expand Down

0 comments on commit 6bfd972

Please sign in to comment.