Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-processing to anticipate empty tokens and spliced tokens #80

Open
ebeshero opened this issue May 10, 2022 · 0 comments
Open

pre-processing to anticipate empty tokens and spliced tokens #80

ebeshero opened this issue May 10, 2022 · 0 comments
Assignees

Comments

@ebeshero
Copy link
Member

ebeshero commented May 10, 2022

This issue is an attempt to evaluate why the tokenization and normalization process generates empty tokens and spliced tokens in the first place. Can we review the tokenization process up close, checking:

  • what pulldom is doing to serialize the XML as a string in the extract() function
  • Try running just the tokenization portion of the script and watch it carefully.
  • Then the tokenizaton + normalization

For <add> and other elements in the ignore list, they are perhaps being removed together with a space following it. This may cause the preceding and following grams to be fused together into a token.

For <lb/> and other elements in the inlineEmpty list, they are possibly being removed in a way that preserves the spaces around them, generating an extra space that gets interpreted as an empty token.

Solutions:

  • I think we may not want to have an ignore list at all (these elements never appear in the output, and I think we have found out the hard way that we need all the markup for S-GA).
  • I think we may also want to intervene to remove the space after <lb/> and other inlineEmpty friends: a move of normalization that may need to occur before tokenization.
@ebeshero ebeshero self-assigned this May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant