-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with leading/trailing whitespace in text content #88
Comments
I agree. I think it is far better to normalize these. Thanks! |
Dear Maarten,
I was just about to point out a whitespace issue when using ucto — not sure, if fully related.
There are whitespace insertions and deletions.
Where shall I report this?
Thanks & cheers,
Piroska
… On Dec 8, 2020, at 2:15 PM, Maarten van Gompel ***@***.***> wrote:
We had an extensive earlier discussion on this in #34, but an issue popped up.
foliatextcontent produces FoLiA likke the follow:
<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21"
>
<
t
>
<
t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str
>
</
t
>
<
t class="OCR"
>
<
t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str
>
</
t
>
<
str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233"
>
<
t offset="0">INTRODUCTION</t
>
<
t offset="0" class="OCR">INTRODUCTION</t
>
<
relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple"
>
<
xref id="word_1_233" type="str"
/>
</
relation
>
</
str
>
</
p>
folialint stumbles on this with a text consistency problem:
ticcl_output/OllevierGeets.ticcl.folia.xml failed: Unresolvable text: Text for str(ID=FH-OllevierGeets-1.tif.text.par_1_21.word_1_233, textclass='current'), has incorrect offset 0
original msg=Unresolvable text: Reference (ID FH-OllevierGeets-1.tif.text.par_1_21,class='current') found, but no text match at offset=0 Expected 'INTRODUCTION' but got '
INT'
Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION".
foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?):
TEXT VALIDATION ERROR: Text for String, ID FH-OllevierGeets-4.tif.text.par_1_36.word_1_708, textclass OCR, has incorrect offset 0 or invalid reference
VALIDATION ERROR on full parse by library (stage 2/3), in OllevierGeets.ticcl.folia.xml
UnresolvableTextContent: Reference (ID FH-OllevierGeets-4.tif.text.par_1_36, class=OCR) found but no text match at specified offset (0)! Expected 'DISCUSSION', got '
D'
The offsets do not do any kind of space normalization by default, as addressed in #34, a text like:
<s
>
<
t
>This is
a sentence</
t
>
</
s>
This really means This is\n\s\s\s\s\s\s\s\s\sa sentence. and not This is a sentence.
But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms.
<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21"
>
<
t><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t
>
<
t class="OCR"><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t
>
<
str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233"
>
<
t offset="0">INTRODUCTION</t
>
<
t offset="0" class="OCR">INTRODUCTION</t
>
<
relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple"
>
<
xref id="word_1_233" type="str"
/>
</
relation
>
</
str
>
</
p>
Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@pirolen If you think it's a tokenisation issue then it's best to put it in https://github.com/LanguageMachines/ucto/issues . If you're referring to insertion/deletion corrections in FLAT then best to put it in https://github.com/proycon/flat/issues |
I tried to reproduce this problem, but folialint failed to fail Are you sure this isn't already fixed on Nov 17:
Or maybe it is very related? UPDATE: |
I think I tackled this now in libfolia as well, I'll continue by testing it in the PICCL context where the issue emerged. |
I'm afraid our problems with whitespace are not over yet. I take the example @kosloot gave in LanguageMachines/foliautils#56. This output has been formatted this way by libxml2 itself, but this formatting is not compatible with the FoLiA assumptions we held until now: <t>
<t-str xml:id="text.p.1.t-str.1">
<t-style>deel<t-hbr/></t-style>
</t-str>
<t-str xml:id="text.p.1.t-str.2">
<t-style>woord</t-style>
</t-str>
<t-str>extra</t-str>
</t> With the current rules we applied, the text representation that both foliapy and libfolia give is:
Also if we simplify the example to: <t>
<t-style>deel<t-hbr/></t-style>
<t-style>woord</t-style>
<t-str>extra</t-str>
</t> We get that same result. The extra bonus is that as soon as we add a space prior to the word extra, that libxml2 serializes the whole I don't think the text representations are good as they are, with all the indentation, and I think what we're getting now is at odds with how XML sees things. I think what we want in this case is one of two options:
<span>
<span>deel</span>
<span>woord</span>
<span>extra</span>
</span> (see https://download.anaproy.nl/deelwoordextra.html) If we go for option 1, this does beg the question how we would represent a space if we do want it, say for example between woord and extra. I think the solution to that would be: <t>
<t-style>deel<t-hbr/></t-style>
<t-style>woord</t-style> <t-str>extra</t-str>
</t> If we go for option 2, then it begs the question how we would represent the non-spaced scenario, the solution would be: <t>
<t-style>deel<t-hbr/></t-style>
<t-style>woord</t-style><t-str>extra</t-str>
</t> I think we're currently closer to option 1 in our interpretations, but I need to do some investigation whether option 2 isn't the more natural XML interpretation (after all, it's what HTML does too). Whatever we choose, we have to take into account the fact that twe didn't impose this strictness before and therefore be lenient not to break older files, as addressed in issue #92. Of course, the one-line solution avoids all these problems in all cases and is the simplest, but it's apparently not what libxml2 prefers to output (pretty formatting), nor something we can expect users to adhere too: <t><t-style>deel<t-hbr/></t-style><t-style>woord</t-style> <t-str>extra</t-str></t> It would be good if we had a way to normalize our FoLiA's to force this one-line representation (as an extra tool), because it would be a valuable preprocessing step that can solve issues like proycon/foliatools#29 and make things easier for parsers that can't deal with all these complexities. |
Hmm, it truly is complex. I ponder about the
or
or such? Anyway not just a space after 'deel' I assume, but some representation of the |
Ah yes, possibly, I didn't consider any representation of t-hbr . I don't think we currently represent it even, do we? Let's save that for another issue :) |
Well, it was the source for LanguageMachines/foliautils#56 |
After tokenization with ucto, the t-hbr is gone/turned into a token boundary. In my ideal workflow, the soft break would stay recoverable (and propagatable to FLAT and folia2html), if possible at all. |
A remaining issue, raised by @kosloot, is whether we should actively normalize the more exotic unicode spaces ( see https://en.wikipedia.org/wiki/Whitespace_character#Unicode) to a normal space. This is probably a good idea, but we may need to introduce an explicit |
Thanks! <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.5">
<t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>im wirtschaftlichen Interessenkampf gegen die Agrarpartei verwert<t-hbr/></t-style>
</t-str>
<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.6">
<t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>baren Schauergemälde bieten</t-style>
<t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>6</t-style>
<t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>, oder welche die Agrarverhältnisse</t-style>
</t-str> <t class="OCR">
<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.1">
<t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>Um nicht gewisse Bemerkungen über die Arbeitsverfassung im</t-style>
</t-str>
<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.2">
<t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>ganzen</t-style>
<t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>1</t-style>
<t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>) bei jedem einzelnen Bezirk wiederholen zu müssen, habe</t-style>
</t-str>
</t> <t class="OCR">
<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p4.t-str.1">
<t-style><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>1</t-style>
<t-style><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>) Grundlage bleibt nach wie vor in dieser Beziehung die Schrift von v.d. Goltz,</t-style>
</t-str>
</t> |
@pirolen To accomplish that in the new situation, there can not be a newline between the two elements (so they must be on the same line). I think this is generated by FoLiA-abby right? We'll have to make sure it produces proper FoLiA in such cases. |
Yes, the examples come from FoLiA-abby. |
We have to look into this as soon as all text issues have been resolved. At the moment it is a moving target. |
Still I think we are getting into trouble anyway. Original text: <t>
<t-str>item1<t-style><feat class="superscript" subset="font_typeface"/>2</t-style></t-str>
</t> When using ucto, a string will be extracted like this: In an ideal world, extracting text form the FoLiA would re-introduce the superscript, I'm stuck here |
I see the problem yes. Technically, following all the rules, the text serialisation
Indeed, and in general styles don't transfer to plain text. You'd need a markup language for that (like Markdown). Properly interpreting styles in custom sets can only be done by the user. We don't certainly don't want text serialisation in FoLiA libraries to even attempt that. |
Superscript and subscript are the t-style classes that would imply a token boundary, the others don't (e.g. italic, bold). |
@pirolen: @proycon To make this more generic: Could we extend the t-style with an attribute like BUT: There is also another issue, text like: |
Would be nice if adding the special symbol around the t-style text element would solve it. The whole phenomenon reminds me a bit of the choice of tags in sequence labeling, where one can use the prefixes I-O, B-I-O, etc. in combination with the applicable tag (like for a named entity), or simply use the name of the tag as the label. Each of the choices implicitly encodes a specific logic for the tools that ingest the labeled data (and for the humans who interpret them). |
More pondering on this: <t>
<t-str>item1<t-style><feat class="superscript" subset="font_typeface" separated="yes"/>2</t-style></t-str><t-str>something</t-str>
</t> (The original text was: What should we do here. I assume there is no need to insert 'hidden' characters here, but to implement the str() extraction function so that it does 'the right thing' This might raise a lot of problems later on. Maybe the clearest solution is, to implement the'separated'attribute, with the semantics of: In this way we do not break any old behavior, and don't introduces fuzzy and surprising characters. |
Gut feeling: Would it be feasible to regard/treat sub-/superscripted text as a specific type of punctuation? :-o Semantically it seems related to it (=it aids and directs the reading of the text). But just like soft break, its behavior could be configurable. ? |
Just to prevent confusion: I definitely don't think there should be zero-width spaces in the FoLiA itself. At most the text extraction function could output one where a token boundary must occur and no space happens, but that would have to be an opt-in feature. And as you said, I foresee issues with the offsets then. So I see where you are going with the Fundamentally, the issue we're discussing now is a tokeniser issue rather than a FoLiA representation problem (so I see it as distinct form the original issue in this thread). The question is how the tokeniser decides what to tokenize and what not:
Text content on higher levels is by definition untokenised (so I'm a bit skeptic about adding tokenisation details in there), text content on the word/token level is by definition tokenised. The issue is of course getting from A to B here (which is the task of the tokeniser). I'm following the line of the extra attribute Ko suggested. But I'm trying to think in a generic way if we expand FoLiA for this: we're essentially encoding some extra 'cue' in the FoLiA to help another tool do its job, and such a cue is needed because the information is not present in the FoLiA yet, or is too complexly encoded. This might be useful for other uses cases than the one we are considering now. What if we introduce a generic <t>
<t-str>item1<t-style tag="token"><feat class="superscript" subset="font_typeface"/>2</t-style></t-str><t-str>something</t-str>
</t> It's essentially what Ko suggested but stretched to be more generic, it gives some processor-specific flexibility. You can envision tool A setting particular tags, and tool B acting on them. Note: I opened a new issue for this proposal, see below |
We had an extensive earlier discussion on this in #34, but an issue popped up.
foliatextcontent produces FoLiA likke the follow:
folialint stumbles on this with a text consistency problem:
Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION".
foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?):
The offsets do not do any kind of space normalization by default, as addressed in #34, a text like:
This really means
This is\n\s\s\s\s\s\s\s\s\sa sentence.
and notThis is a sentence.
But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms.
Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!).
The text was updated successfully, but these errors were encountered: