You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
High-level issue on ONB hOCR file converstion to canonical.
Partially Depends on milestone 🚩: ONB Acquisition.
Exploration and decision on approach
It will have to be decided what is best between first converting hOCR => canonical or passing through ALTO to benefit from already written pieces of code (hOCR => ALTO => canonical)
A first version of the ONB importer Alto -> Canonical has been implemented to handle all the ANNO data.
In order to have a better idea of the possibilities regarding the ANNOP data, which is in hOCR format, a few hOCR -> Alto converters have been tested on a small sample of data.
It's worth noting that the source data (sample esj/1772/0057) does seem to have some irregularities in its formatting or contents. In particular, none of the converters tried worked whenever the source file contained the characters ­ between two <span> separators. Eg: <span class='ocrx_word' title='bbox 851 1642 933 1682;x_wconf 28'>’110/2</span>­</span><span class='ocr_line' title='bbox 141 1704 953 1790;x_wconf 43'>
Yields the best results of all the tested converters. The coordinates are parsed correctly.
Text-style, or language information is lost in the process.
Could theoretically be used, but it did not run on many of the sample pages tried, so it cannot be used at a relatively large scale like it would be necessary for us.
Overall, it seems easier and more reliable to directly implement another ONB importer performing hOCR -> Canonical, as the hOCR syntax is relatively simple, especially with slighlty irregular data.
High-level issue on ONB hOCR file converstion to canonical.
Partially Depends on milestone 🚩: ONB Acquisition.
Exploration and decision on approach
It will have to be decided what is best between first converting hOCR => canonical or passing through ALTO to benefit from already written pieces of code (hOCR => ALTO => canonical)
In case, a few links which may be useful:
First implementation on samples
Full importer test on ONB complete data
The text was updated successfully, but these errors were encountered: