-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactoring to work with the annotated plain text #18
base: master
Are you sure you want to change the base?
Conversation
… various industries.
…es_from_files` to `load_trees`.
webstruct/feature_extraction.py
Outdated
@@ -3,7 +3,7 @@ | |||
:mod:`webstruct.feature_extraction` contains classes that help | |||
with: | |||
|
|||
- converting HTML pages into lists of feature dicts and | |||
- converting annnotated data into lists of feature dicts and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the data is not necessarily annotated: HtmlLoader is used to load raw data
My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion. |
HtmlFeatureExtractor to FeatureExtractor
sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn
pipeline
and some evaluation tools out of box.the input text annotated text is similar to GATE: e.g.
this is a <NER>test</NER>
. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.