Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONB importer #120

Open
1 of 3 tasks
e-maud opened this issue Nov 12, 2023 · 1 comment
Open
1 of 3 tasks

ONB importer #120

e-maud opened this issue Nov 12, 2023 · 1 comment
Assignees

Comments

@e-maud
Copy link
Member

e-maud commented Nov 12, 2023

High-level issue on ONB hOCR file converstion to canonical.
Partially Depends on milestone 🚩: ONB Acquisition.

  • Exploration and decision on approach

    It will have to be decided what is best between first converting hOCR => canonical or passing through ALTO to benefit from already written pieces of code (hOCR => ALTO => canonical)

    In case, a few links which may be useful:

  • First implementation on samples

    • After having clarified sample data details with ONB (cf. issue#17 in data-acquisition)
    • Link to samples: on gdrive.
  • Full importer test on ONB complete data

@piconti
Copy link
Member

piconti commented Dec 14, 2023

Update on the progress for the ONB importer.

A first version of the ONB importer Alto -> Canonical has been implemented to handle all the ANNO data.

In order to have a better idea of the possibilities regarding the ANNOP data, which is in hOCR format, a few hOCR -> Alto converters have been tested on a small sample of data.
It's worth noting that the source data (sample esj/1772/0057) does seem to have some irregularities in its formatting or contents. In particular, none of the converters tried worked whenever the source file contained the characters &shy; between two <span> separators. Eg:
<span class='ocrx_word' title='bbox 851 1642 933 1682;x_wconf 28'>’110/2</span>&shy;</span><span class='ocr_line' title='bbox 141 1704 953 1790;x_wconf 43'>

The converters tested were the following:

  • ocr-fileformat
    • Unsatisfactory results, with the values for height systematically missing or NaN.
    • Text-style, or language information is lost in the process
    • Runs using docker, either with a web interface of CLI
  • hOCR-to-ALTO
    • Similar results to ocr-fileformat, Coordinates are also not correctly parsed.
    • Text-style, or language information is lost in the process
  • hOCRTools
    • Yields the best results of all the tested converters. The coordinates are parsed correctly.
    • Text-style, or language information is lost in the process.
    • Could theoretically be used, but it did not run on many of the sample pages tried, so it cannot be used at a relatively large scale like it would be necessary for us.

Overall, it seems easier and more reliable to directly implement another ONB importer performing hOCR -> Canonical, as the hOCR syntax is relatively simple, especially with slighlty irregular data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants