-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FoLiA-2text. How to handle <t-str>
and <t-hbr/>
correctly. Is it even possible?
#56
Comments
This is a good point indeed. Perhaps this has been an oversight when we handled proycon/folia#88. I agree that the behaviour is not entirely intuitive here now. The spacing and newlines between the two t-strings is not stripped (because it's not initial nor trailing space). It's also essential that we allow spacing there because we'll often see things like But in case of a direct newline and spacing after |
As for |
OK, I discovered that libxml2's output formatting is sensitive for spaces inside text nodes. To demonstrate this, I added a test tot foliatests, text_test18() <t>
<t-str xml:id="text.p.1.t-str.1">
<t-style>deel<feat class="some" subset="things"/><t-hbr/></t-style>
</t-str>
<t-str xml:id="text.p.1.t-str.2">
<t-style>woord<feat class="other" subset="things"/></t-style>
</t-str>
<t-str> extra</t-str>
</t> and <t><t-str xml:id="text.p.1.t-str.1"><t-style>deel<feat class="some" subset="things"/><t-hbr/></t-
style></t-str><t-str xml:id="text.p.1.t-str.2"><t-style>woord<feat class="other" subset="things"/></t-s
tyle></t-str> <t-str>extra</t-str></t> The only difference is that in the first case the space is added in " extra", and in the second case as a separate XmlText element, I can probably use this in FoLiA-abby to "do the right thing". I wonder if FoLiA-PY also able to get this result |
Hopefully not off (otherwise please delete...): |
Apart form all the text representation issues, there is a more fundamental one considering FoLiA-abby. For this 'textview', items like <text xml:id="text">
<p xml:id="text.p1">
<t>Da in diesem Kapitel hauptsächlich eine Erscheinung der Kaiserzeit behandelt werden soll,</t>
<metric class="first_char_top" value="3316"/>
<metric class="first_char_left" value="379"/>
<metric class="first_char_right" value="406"/>
<metric class="first_char_bottom" value="3358"/>
<metric class="last_char_top" value="3351"/>
<metric class="last_char_left" value="905"/>
<metric class="last_char_right" value="911"/>
<metric class="last_char_bottom" value="3357"/>
</p>
</text> Over this: <text xml:id="text">
<p xml:id="text.p1">
<t>
<t-str xml:id="text.p2.t-str.1">
<t-style><feat class="Arial" subset="font_family"/><feat class="6." subset="font_size"/>Da in diesem Kapitel hauptsächlich eine Erschei<t-hbr/>
</t-style>
</t-str><t-str xml:id="text.p2.t-str.2">
<t-style><feat class="Arial" subset="font_family"/><feat class="6." subset="font_size"/>nung der Kaiserzeit behandelt werden soll,</t-style>
</t-str>
</t>
<metric class="first_char_top" value="3316"/>
<metric class="first_char_left" value="379"/>
<metric class="first_char_right" value="406"/>
<metric class="first_char_bottom" value="3358"/>
<metric class="last_char_top" value="3351"/>
<metric class="last_char_left" value="905"/>
<metric class="last_char_right" value="911"/>
<metric class="last_char_bottom" value="3357"/>
</p>
</text> Our challenge is, to accommodate both. @proycon what do you think? |
I'm not sure if we need more options and flavour. I agree that the challenge is to accommodate both in a way. Frog, ucto and other FoLiA tools should be as happy with the first fragment as with the second and be able to process it both. From the perspective of Frog and ucto, the more convoluted example is functionally equivalent to the simpler one. Of course you can always add extra text layers, but I'm not sure if that doesn't add more to the confusion than alleviate it. I'm more concerned about what you addressed earlier in your text_test18 case, that really is something I need to dive into and we need to get straight. |
I think that text_test18 issue relates to what we did in proycon/folia#92 and earlier in proycon/folia#88 |
I agree that we should come up with a minimal and as simple solution as possible. But supporting both 'worlds' can be cumbersome, I fear. An original text of:
can be presented in a lot of ways in FoLiA, but the preferred text representation for ucto/frog
So the task is, whatever FoLiA representation is used, to get the above 'basic' form. |
in proycon/folia#88 @pirolen says:
This is intentional. The idea is that hyphenated words are to be 'corrected' into their 'normal' form
into
I agree, that we loose some information then. Simplest solution is, imho to introduce a 'soft-hyphen' there. (unicode \u00ad) see https://en.wikipedia.org/wiki/Soft_hyphen Note that is also possible to add a text-string to the That might also be helpful, but specific for a singe use-case. A more generic solution is desirable. |
I see, cool. For a bit more context: In my scenario, both t-hbr and br are needed for provenance reasons, so that one can maintain line break information from the original OCR-ed documents. It might be that (in my use case) there is some similarity in the way I'd like to make use of the breaks resp. the font style information: both need to be kept and reassigned after using ucto + FLAT... |
More details: Seems like the converter puts each line into a separate t-str:
and then each t-str boundary is interpreted by ucto as a token boundary:
|
That is not a problem with ucto as such. The indentation between 'der' and 'jenigen' is kept. |
Sure, just meant to provide some (hopefully relevant) details, as @proycon was asking:
|
Ok, the fact that they are in separate A correct input for your desired ucto output (in the FoLiA v2.5 situation) would be: <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.2">
<t-style><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}" subset="font_style"/>wertlos halten. Es entspricht einerseits nicht den Erwartungen der<t-hbr/></t-style></t-str><t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.3"><t-style><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}" subset="font_style"/>jenigen, welche in betreff der Lage der Landarbeiter nur solche</t-style>
</t-str> It might be a bit tricky to get the proper XML serialisation, I find it hard to predict how it will turn out. In such cases we may want to go for a normalised serialisation that simply keeps the entire element one line (no indentation and newlines), which makes everything a lot easier. |
Well, as pointed out 22 days ago, I was able to such a serialization, but it is indeed trickery. It might be sufficient though. |
So after a lot of work by @proycon and me on FoLiA definitions and some additions, I finally got a version of libfolia and foliautils where FoLiA-abby seems to do 'the right thing' I hope we finally reached a satisfying result. |
Awesome, thanks to both of you very much for this grand effort! I am happy to test it later today. |
Is it enough if I update a dev lamachine instance with --only languagemachines-basic and --only languagemachines-python? The full update somehow fails. |
@proycon should be able to help you. I am no LaMachine expert at all |
Yes, that should be enough |
If I convert the FoLiA-abby output with folia2html, there is whitespace between a word and its (adjacent) sub-/superscripted item. |
The extra whitespace seems to emerge also with the other typographic style markups, e.g. italic. There is a double space before 'lokalen' (see screenshot below):
in html:
|
OK, forgot to un-comment some code that removed spurious spaces before a style element. |
Yes, the extra spaces disappeared, great! But spotted space lacking, e.g. here after the italics of 'Vgl. auch': In:
Out:
|
But maybe one has to account for t-hspace in the css for the folia2html converter? |
Well, a |
There seems to be a t-hbr issue in relation to ucto + FLAT. I run ucto on a file and tried uploading in to FLAT.
Just in case:
Out:
|
That has been implemented in the latest version of foliatools, t-hspace renders as a simple space (and you can override it in your custom css if you want something more specific). As to the foliavalidator issue, I see you're doing deep validation, does shallow validation pass? And are you running the FLAT from the latest development LaMachine (because we didn't release the latest changes to FLAT and foliadocserve yet)? |
Seems to me that it does not render as a space.
(Just because I thought the deep validation is the safest, but probably there is much more to it...) If I simply run foliavalidator the output is:
Ah, that's right, apologies, forgot. I am not using LaMachine for FLAT. Will test the ucto input to it later then, on the new release. FA-mittelalt_bibkat_sample_002.png.folia.xml.txt |
observation:
Ucto runs smoothly on this file. folia2txt too, but gives some more empty lines compared to FoLiA-2text (but otherwise no differences) running foliavalidator -d on the Ucto output gives:
the missing SYMBOL seems an oversight in the set definition. |
I added SYMBOL (and EMOTICON and PICTOGRAM) to the setdefinitions, Deep validation is happy now. |
All seems fine so far in my current test scenarios. |
Sorry, also revisiting this:
So this is the Abbyy input:
and the FoLiA-abby output:
And thus the t-hbr is still a boundary between "vocabu" and "lorum".
Am I overlooking something? :-( |
This seems a bug/oversight I will look into it |
This is a more fundamental problem. The FoLiA more or less reflects the nature of the input. There are several escapes possible.
Which road would be best? I would opt for the second one, I guess. As a side-note, the '¬' is output using the --keephyphens option, so maybe that is usable? |
The --addbreaks option preserves the newline information in the original data, so luckily that is taken care of already. The space issue for multiple t-style elements was also solved by putting them on one line (i.e., option 2 -- plus introduced the t-hspace, if needed). |
I think this is fixed now. The problem was, that I didn't expect a space to be in a different font style then the characters before it. |
I implemented a solution. For now ONLY when the '¬' is present. |
Awesome, thanks for both of the changes! |
I gave it a shot. |
The output looks very good, thank you! Will keep using/testing. |
Closing this issue, as it is a long messy thread of issues, most off them solved. |
given the following FoLiA. (which is a simplified outcome of FoLiA-abby, with all features and metrics removed.)
The output from FoLiA-2text (and also it;s counterpart folia2txt) is:
There are 2 issues here, which are very related.
<t-hbr/>
is NOT reflected in the output at allIn fact this is maybe just one problem. A small modification of the input will fix this:
so appending the second directly after the first.
This gives the correct:
Da in diesem Kapitel hauptsächlich eine Erscheinung der Kaiserzeit behandelt werden soll,
This behavior might be surprising for a naive user, and, as far as I know, it can only be done correctly using an editor to get this kind of formatting done.
At the moment, I don't see a way to get this result using the libxml2 API (but I may stand corrected)
NB: I know we could output the whole FoLiA in one flat line, but this is hardly desirable.
Comments welcome
The text was updated successfully, but these errors were encountered: