Adding dataset LiDi 1.0 project #152

Giorgiaagostini · 2024-07-04T09:00:15Z

Hello HTR-united team!

please consider the following data set description for inclusion in your directory.

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: LiDi1.0-project
url: https://github.com/Giorgiaagostini/LiDi1.0-project
authors:
 - name: Giorgia
   surname: Agostini
   orcid: 0009-0007-9887-5129
   roles:
     - transcriber
     - aligner
     - project-manager
     - quality-control
institutions: []
description: >-
 This repository contains all data relating to the LiDi 1.0 project. In
 particular HTR GT of 16th antiquarian Pirro Ligorio, used to create
 Transkribus public model Ligorio 0.3 PyL.
project-name: LiDi 1.0
project-website: https://lidiws-limes.cfs.unipi.it
language:
 - ita
production-software: Transkribus
automatically-aligned: false
script:
 - iso: Latn
 - iso: Grek
script-type: only-manuscript
time:
 notBefore: '1568'
 notAfter: '1580'
hands:
 count: '1'
 precision: estimated
license:
 name: CC-BY-SA 4.0
 url: https://creativecommons.org/licenses/by-sa/4.0/
format: Alto-XML
sources:
 - reference: ''
   link: >-
     https://archiviodistatotorino.beniculturali.it/dbadd/visvol_bibl.php?uid=300146
volume:
 - metric: files
   count: 195
citation-file-link: >-
 https://github.com/Giorgiaagostini/LiDi1.0-project/blob/main/Data/Ground%20Truth/CITATION.cff
transcription-guidelines: >-
 - Normalisation of «V» to «U» except in Latin inscriptions;

 - Preservation of the diacritical marks and punctuation as used by the Author
 except for the part in Greek;

 - Where the use of capital and small caps is not distinguished, it is
 transcribed according to the grammatical rules of the Italian language;

 - Tagging of uncertain words with the «unclear» tag;

 - Tagging of illegible words with three dots (...) and the «unclear» tag;

 - Use of the angle dash, instead of the hyphen, to divide words into syllables
 at the end of a line.

 Moreover due to some issues in the visualization of ancient symbols unicode,
 the Roman Denarius (U+10196) and the Roman Sestersius (U+10198) signs were
 transcribed using other symbols not used by the author from the Astronomical
 chart:

 Roman denarius sign ➛♀(U+2640 Female sign)

 Roman sestertius sign➛☿ (U+263F Mercury)

 In order to change them to the correct one during post-processing.

alix-tz · 2024-07-04T15:50:17Z

Hello Giorgia,

Thank you very much for your contribution!

It looks like there are only the XML files in your repository, which is not enough to get a complete GT dataset. I see however that in "sources" you put the link to the image visualizer on the website of the Archivio di stato di Torino. I think it would be useful if you can add, in the README of your dataset repository, clear indications that the images are not included in the dataset but that they can be downloaded there (if they can be?). Basically anything to facilitate the reconstruction of the ground truth dataset.

From comparing the viewer and your data, I have the impression that you pre-processed the images to get single pages instead of double pages. This pre-procesing step might be difficult to reproduced in a way that guarantees that the images and the XML files are correctly aligned. If I am right with my understanding, in my opinion, this is reason enough to publish your preprocessed images along with the XML files (if the license on the image allows it).

What do you think? Is there anything that can be done in this regard?

Giorgiaagostini · 2024-07-26T08:03:14Z

Dear Alix, I am sorry if I am getting back to you just now. Thank you for your advice, Unfortunately, the images can't be downloaded from the digital library of the Archivio di Stato di Torino. I will try to get permission to publish the images, that were already pre-processed by the archive. I will keep you posted, Best regards, *Giorgia Agostini * Dottorato di ricerca in Storia delle Arti e dello Spettacolo - Digital Humanities. Università degli Studi di Firenze (SAGAS). https://lidiws-limes.cfs.unipi.it/ ***@***.*** Il giorno gio 4 lug 2024 alle ore 17:50 Alix Chagué < ***@***.***> ha scritto:

…

Hello Giorgia, Thank you very much for your contribution! It looks like there are only the XML files in your repository, which is not enough to get a complete GT dataset. I see however that in "sources" you put the link to the image visualizer on the website of the Archivio di stato di Torino. I think it would be useful if you can add, in the README of your dataset repository, clear indications that the images are not included in the dataset but that they can be downloaded there (if they can be?). Basically anything to facilitate the reconstruction of the ground truth dataset. From comparing the viewer and your data, I have the impression that you pre-processed the images to get single pages instead of double pages. This pre-procesing step might be difficult to reproduced in a way that guarantees that the images and the XML files are correctly aligned. If I am right with my understanding, in my opinion, this is reason enough to publish your preprocessed images along with the XML files (if the license on the image allows it). What do you think? Is there anything that can be done in this regard? — Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6W5GLXJLFCBDDZQBYXN4QLZKVVM7AVCNFSM6AAAAABKLDMFLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGI3DSMJQGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding dataset LiDi 1.0 project #152

Adding dataset LiDi 1.0 project #152

Giorgiaagostini commented Jul 4, 2024

alix-tz commented Jul 4, 2024

Giorgiaagostini commented Jul 26, 2024 via email

Adding dataset LiDi 1.0 project #152

Adding dataset LiDi 1.0 project #152

Comments

Giorgiaagostini commented Jul 4, 2024

alix-tz commented Jul 4, 2024

Giorgiaagostini commented Jul 26, 2024 via email