-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added documentation about token hyphenation in the page schema #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some suggestions and questions.
docs/page.schema.json
Outdated
}, | ||
"hy": { | ||
"type": "boolean", | ||
"description": "Indicates whether the token constitutes the first part of a hyphenated word. When not specified it is assumed to be `false`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we record the hyphen itself as well or is it implicit similiar to the whitespace? We have some words with multiple hyphenation due to limited space in a table cell. How do i deal with this? Should I record all? Accordingly, I suggest to write "the former part before the hyphen (incl. / excl. hyphen)" instead of "first".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's an interesting case that we haven't had so far. I would mark the first part with hy=True
, and all the remaining with nf=...
.
What do you think @simon-clematide and @e-maud ?
Anyway, we will have to test carefully the behavior of the rebuilt script in such cases.
docs/page.schema.json
Outdated
}, | ||
"nf": { | ||
"type": "string", | ||
"description": "It is specified on the second part of a hyphenated word, and contains its normalized (reconstructed) form." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, this is more straightforward:
"normalized (dehyphenated)" instead of "normalized (reconstructed")
in case we allow for multi-hyphenation change "second" to "latter".
closes issue #17