Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[foliatextcontent] propagate markup information to higher/lower levels #19

Open
proycon opened this issue Nov 11, 2020 · 10 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@proycon
Copy link
Owner

proycon commented Nov 11, 2020

If there is markup information in a higher text layer, say on paragraph level, we want to be able to replicate that markup information on lower levels (say sentence or words), if not yet available. We also want the reverse, if there markup information on lower levels, we want to express it also on higher levels.

@proycon proycon added the enhancement New feature or request label Nov 11, 2020
@proycon proycon self-assigned this Nov 11, 2020
@proycon
Copy link
Owner Author

proycon commented Nov 11, 2020

(relates also to #16)

@proycon
Copy link
Owner Author

proycon commented Dec 4, 2020

Technical note: This does introduce another challenge. Just like we have text consistency and text validation in FoLiA, ensuring that text specified on multiple levels is consistent with eachother, this would introduce a similar concept of markup consistency, markup validation.

@kosloot
Copy link
Collaborator

kosloot commented Dec 15, 2020

This seems like something you would like to have in 'some' occasions, but not always. e.g. for a tokenizer, you would like to have the 'un-styled' strings. So maybe we must introduce a 'formatted' attribute or such in the <t> nodes?

<t>This is a good example</>

vs.

<t formatted="1">This is a<t-style class="bold">good</t-style> example</t>

On second thought: this is not a really good idea, it would break to many things. Still we need both worlds.
So a more down to earth solution is adding text() variants that maintain the structure. keeping the current text() and str() functions. The return value probably being a TextContent.

@proycon
Copy link
Owner Author

proycon commented Dec 16, 2020

I don't think there's any need for such an attribute and don't really see what problem it would solve. Calling text() on a TextContent element returns the plain text (regardless of any markup within), similarly an x-path text() does the same, we definitely shouldn't change that.

Getting all the markup requires calling textcontent() and then diving deeper into that, it's a bit more complex by definition but that can't be helped I think.

@kosloot
Copy link
Collaborator

kosloot commented Dec 16, 2020

Yes, in fact that was my conclusion too. The new function I mentioned should do the deeper diving. An return a TextContent which holds the (combined) styles of the deeper elements

@pirolen
Copy link

pirolen commented Feb 1, 2021

[Not sure if this is the right place to ask]
Would the font styling info propagation also work in Ucto?

My FLAT use case is to tokenize the text first in order to enable the (fully manual) word-level spelling error corrections.
I would be happy if this would be achievable, even if decoupled from viewing/accepting TICCL's suggestions (which one could visualize in parallel in an editor, or a separate FLAT window, for the moment).

@pirolen
Copy link

pirolen commented Feb 1, 2021

[Related]
Currently I don't seem to successfully call textcontent() on a paragraph's parts using foliapy :-(
and only the 1st part is accessed... I did

for par in doc.paragraphs(): 
    for part in par.annotation(folia.Part):
        print(part.text()) ## works
        print(type(part.textcontent())) ## raises an error.

File "/home/ubuntu/piro/projects/lamadev/lmdev/lib/python3.6/site-packages/folia/main.py", line 1199, in textcontent raise NoSuchText folia.main.NoSuchText

Specifying the class as ... .textcontent(cls="OCR") does not seem to make a difference.

@pirolen
Copy link

pirolen commented Feb 23, 2021

Just wondering if one can add some dummy placeholder style-markup annotation to ucto-tokenized folia.xml in FLAT (regardless of the propagation not yet being in place).
Would it be in principle possible, just for the sake of carrying out some test annotation round?

@proycon
Copy link
Owner Author

proycon commented Mar 3, 2021

Would the font styling info propagation also work in Ucto?

It would be a separate post-processing step you need to run after ucto

Just wondering if one can add some dummy placeholder style-markup annotation to ucto-tokenized folia.xml in FLAT (regardless of the propagation not yet being in place).

FLAT doesn't really render the style-markup at all currently.

@pirolen
Copy link

pirolen commented Mar 3, 2021

It would be a separate post-processing step you need to run after ucto

That would be cool in that way too.

FLAT doesn't really render the style-markup at all currently.

I should have asked rather: would it be possible to do some post processing too, after having annotated in FLAT, so that the style information is possible to re-assign so that again other tools, such as folia2html, can process it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants