You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is an attempt to evaluate why the tokenization and normalization process generates empty tokens and spliced tokens in the first place. Can we review the tokenization process up close, checking:
Try running just the tokenization portion of the script and watch it carefully.
Then the tokenizaton + normalization
For <add> and other elements in the ignore list, they are perhaps being removed together with a space following it. This may cause the preceding and following grams to be fused together into a token.
For <lb/> and other elements in the inlineEmpty list, they are possibly being removed in a way that preserves the spaces around them, generating an extra space that gets interpreted as an empty token.
Solutions:
I think we may not want to have an ignore list at all (these elements never appear in the output, and I think we have found out the hard way that we need all the markup for S-GA).
I think we may also want to intervene to remove the space after <lb/> and other inlineEmpty friends: a move of normalization that may need to occur before tokenization.
The text was updated successfully, but these errors were encountered:
This issue is an attempt to evaluate why the tokenization and normalization process generates empty tokens and spliced tokens in the first place. Can we review the tokenization process up close, checking:
For
<add>
and other elements in theignore
list, they are perhaps being removed together with a space following it. This may cause the preceding and following grams to be fused together into a token.For
<lb/>
and other elements in theinlineEmpty
list, they are possibly being removed in a way that preserves the spaces around them, generating an extra space that gets interpreted as an empty token.Solutions:
ignore
list at all (these elements never appear in the output, and I think we have found out the hard way that we need all the markup for S-GA).<lb/>
and otherinlineEmpty
friends: a move of normalization that may need to occur before tokenization.The text was updated successfully, but these errors were encountered: