Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer fixes and span_tokenize method #20

Open
wants to merge 267 commits into
base: master
Choose a base branch
from

Conversation

chekunkov
Copy link
Contributor

@chekunkov chekunkov commented Jun 7, 2014

Tokenizer from #15 had issues like not splitting a dot at the end of a sentence as a separate token

40006,40007c40017
< community
< .

---
> community.
41148,41149c41158
< Reserved
< .

---
> Reserved.

Now this issue should be fixed.

Also I've refactored code and added span_tokenize method (@kmike I remember you said it would be nice to have this method)

Performance wasn't hurt

X, y = webstruct.HtmlTokenizer().tokenize(trees)

CPU times: user 3.42 s, sys: 32 ms, total: 3.46 s
Wall time: 3.45 s

kmike and others added 26 commits May 21, 2014 14:59
Dropping it gives a nice speedup because computations are now in Cython.
…simplify code and make it faster.

If needed, it can be implemented as a global feature.
@kmike
Copy link
Member

kmike commented Nov 25, 2016

@chekunkov do you by chance recall why wasn't this PR merged?

@chekunkov
Copy link
Contributor Author

@kmike nope, have no idea why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants