pre-processing to anticipate empty tokens and spliced tokens #80

ebeshero · 2022-05-10T03:37:51Z

This issue is an attempt to evaluate why the tokenization and normalization process generates empty tokens and spliced tokens in the first place. Can we review the tokenization process up close, checking:

what pulldom is doing to serialize the XML as a string in the extract() function
Try running just the tokenization portion of the script and watch it carefully.
Then the tokenizaton + normalization

For <add> and other elements in the ignore list, they are perhaps being removed together with a space following it. This may cause the preceding and following grams to be fused together into a token.

For <lb/> and other elements in the inlineEmpty list, they are possibly being removed in a way that preserves the spaces around them, generating an extra space that gets interpreted as an empty token.

Solutions:

I think we may not want to have an ignore list at all (these elements never appear in the output, and I think we have found out the hard way that we need all the markup for S-GA).
I think we may also want to intervene to remove the space after <lb/> and other inlineEmpty friends: a move of normalization that may need to occur before tokenization.

The text was updated successfully, but these errors were encountered:

ebeshero self-assigned this May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-processing to anticipate empty tokens and spliced tokens #80

pre-processing to anticipate empty tokens and spliced tokens #80

ebeshero commented May 10, 2022 •

edited

Loading

pre-processing to anticipate empty tokens and spliced tokens #80

pre-processing to anticipate empty tokens and spliced tokens #80

Comments

ebeshero commented May 10, 2022 • edited Loading

ebeshero commented May 10, 2022 •

edited

Loading