Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added documentation about token hyphenation in the page schema #18

Merged
merged 2 commits into from
Oct 11, 2019

Conversation

mromanello
Copy link
Member

@mromanello mromanello commented Oct 8, 2019

closes issue #17

Copy link
Collaborator

@aflueckiger aflueckiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some suggestions and questions.

},
"hy": {
"type": "boolean",
"description": "Indicates whether the token constitutes the first part of a hyphenated word. When not specified it is assumed to be `false`."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we record the hyphen itself as well or is it implicit similiar to the whitespace? We have some words with multiple hyphenation due to limited space in a table cell. How do i deal with this? Should I record all? Accordingly, I suggest to write "the former part before the hyphen (incl. / excl. hyphen)" instead of "first".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's an interesting case that we haven't had so far. I would mark the first part with hy=True, and all the remaining with nf=....

What do you think @simon-clematide and @e-maud ?

Anyway, we will have to test carefully the behavior of the rebuilt script in such cases.

},
"nf": {
"type": "string",
"description": "It is specified on the second part of a hyphenated word, and contains its normalized (reconstructed) form."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, this is more straightforward:
"normalized (dehyphenated)" instead of "normalized (reconstructed")

in case we allow for multi-hyphenation change "second" to "latter".

@mromanello mromanello merged commit 1074ca8 into master Oct 11, 2019
@mromanello mromanello deleted the issue-4/hyphenation branch October 11, 2019 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants