ONB importer #120

e-maud · 2023-11-12T17:37:30Z

High-level issue on ONB hOCR file converstion to canonical.
Partially Depends on milestone 🚩: ONB Acquisition.

Exploration and decision on approach

It will have to be decided what is best between first converting hOCR => canonical or passing through ALTO to benefit from already written pieces of code (hOCR => ALTO => canonical)

In case, a few links which may be useful:
- https://github.com/cneud/ocr-conversion
- https://github.com/ocropus/hocr-tools
First implementation on samples
- After having clarified sample data details with ONB (cf. issue#17 in data-acquisition)
- Link to samples: on gdrive.
Full importer test on ONB complete data

piconti · 2023-12-14T17:18:10Z

Update on the progress for the ONB importer.

A first version of the ONB importer Alto -> Canonical has been implemented to handle all the ANNO data.

In order to have a better idea of the possibilities regarding the ANNOP data, which is in hOCR format, a few hOCR -> Alto converters have been tested on a small sample of data.
It's worth noting that the source data (sample esj/1772/0057) does seem to have some irregularities in its formatting or contents. In particular, none of the converters tried worked whenever the source file contained the characters  between two  separators. Eg:
’110/2

The converters tested were the following:

ocr-fileformat
- Unsatisfactory results, with the values for height systematically missing or NaN.
- Text-style, or language information is lost in the process
- Runs using docker, either with a web interface of CLI
hOCR-to-ALTO
- Similar results to ocr-fileformat, Coordinates are also not correctly parsed.
- Text-style, or language information is lost in the process
hOCRTools
- Yields the best results of all the tested converters. The coordinates are parsed correctly.
- Text-style, or language information is lost in the process.
- Could theoretically be used, but it did not run on many of the sample pages tried, so it cannot be used at a relatively large scale like it would be necessary for us.

Overall, it seems easier and more reliable to directly implement another ONB importer performing hOCR -> Canonical, as the hOCR syntax is relatively simple, especially with slighlty irregular data.

e-maud assigned piconti Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONB importer #120

ONB importer #120

e-maud commented Nov 12, 2023 •

edited by piconti

Loading

piconti commented Dec 14, 2023 •

edited

Loading

ONB importer #120

ONB importer #120

Comments

e-maud commented Nov 12, 2023 • edited by piconti Loading

piconti commented Dec 14, 2023 • edited Loading

e-maud commented Nov 12, 2023 •

edited by piconti

Loading

piconti commented Dec 14, 2023 •

edited

Loading