Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer fixes and span_tokenize method #20

Open
wants to merge 267 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
267 commits
Select commit Hold shift + click to select a range
8392112
change tokenization rules again: don't split : and don't handle contr…
kmike Jul 31, 2013
c9cbedc
always clean html before feature extraction
kmike Jul 31, 2013
a59c4d1
let +34 and -8 be numbers
kmike Jul 31, 2013
4a44842
more features
kmike Jul 31, 2013
113a5c9
more ideas for tags
kmike Jul 31, 2013
feeb392
more annotated data
kmike Jul 31, 2013
9792d0e
finished annotating US contacts pages corpus
kmike Jul 31, 2013
58de2dd
utility for grouping IOB-encoded entities
kmike Jul 31, 2013
1cd4ed6
discourage usage of preprocess.to_features_and_labels
kmike Aug 1, 2013
cf1877b
utility for substrings extraction
kmike Aug 1, 2013
cd40b8d
small cleanup
kmike Aug 5, 2013
4badffc
WapitiTagger class
kmike Aug 5, 2013
4f26c07
add IDEA files to gitignore
kmike Aug 5, 2013
43314ba
handle an edge case for feature extraction
kmike Aug 6, 2013
1ea5656
WapitiChunker is a better name
kmike Aug 6, 2013
caa97de
more training data
kmike Aug 6, 2013
2c37a10
more training data
kmike Aug 6, 2013
66c13d9
remove nltk dependency
kmike Aug 6, 2013
9cac9ed
clarify requirements
kmike Aug 9, 2013
e42cfd0
fix default value in cleaning script
kmike Aug 9, 2013
5b2e22f
reannotated corpus (2/3 so far)
kmike Aug 9, 2013
3be853f
annotation guidelines
kmike Aug 9, 2013
71ba386
finish reannotation
kmike Aug 9, 2013
d43da01
new tags
kmike Aug 9, 2013
5ac1800
simple docs
kmike Aug 11, 2013
6e13b51
add IPython temp files to gitignore
kmike Aug 11, 2013
0e494e4
prepare html pages for NL
tpeng Sep 25, 2013
5a03eab
annotate NL pages
tpeng Sep 25, 2013
91cef1e
fix encoding
tpeng Sep 26, 2013
acf4519
add notebook to train NL open hours parser
tpeng Sep 26, 2013
3634408
add more datetime features
tpeng Sep 26, 2013
3ae0dca
tidy up annotated data
tpeng Sep 26, 2013
4838588
remove ambiguous 316.html and fix 166.xml
tpeng Sep 26, 2013
5265bd0
template tweaks for NL open hour parser
tpeng Sep 26, 2013
bde0e64
add notebook to test the NL openhours parser
tpeng Sep 26, 2013
22b4cbe
fix typos
tpeng Sep 27, 2013
c3f1b41
get rid of block model
tpeng Sep 27, 2013
20ae29e
webstruct_token -> webstruct
tpeng Sep 27, 2013
85360f5
fix import
tpeng Sep 27, 2013
23962f2
update model training in notebooks
tpeng Sep 27, 2013
7dcc8b3
don't set default tags and feature functions
tpeng Sep 27, 2013
4b871e0
update nl-openhousr-parser train/test notebooks
tpeng Sep 27, 2013
6b71e36
update setup.py
tpeng Sep 27, 2013
47df48a
make the feature_extractor in WapitiEncoder non-optional
tpeng Sep 27, 2013
1f97a3f
Merge pull request #2 from scrapinghub/remove-block-model
kmike Sep 27, 2013
220a25b
as we started to use setuptools, use it everywhere
kmike Sep 27, 2013
3f15fa9
minor cleanup: use full imports in docstrings, don't use import *, mo…
kmike Sep 27, 2013
3f4c7ef
fix training notebook
kmike Sep 27, 2013
5fb8dc3
fix some NL annotated data
tpeng Oct 15, 2013
916b003
prepare data for Ireland openhours parser
tpeng Oct 16, 2013
0307c4d
annotate Ireland pages for openhours parser
tpeng Oct 16, 2013
06aa676
add notebook to train openhours parser for Ireland
tpeng Oct 16, 2013
15e8c4a
notebooks and document tweaks
tpeng Oct 16, 2013
246e13e
fix NL/IE openhours train notebooks
tpeng Oct 16, 2013
696cf1b
tweak IE openhours training notebook
tpeng Oct 17, 2013
73d9b65
set encoding argument implicitly
tpeng Oct 18, 2013
37e432e
Merge pull request #3 from scrapinghub/fix-encoding
kmike Oct 18, 2013
ce8ac23
fix some IE training data
tpeng Oct 21, 2013
b9d843d
bump version to 0.1.1
tpeng Oct 22, 2013
876f20c
retrain IE openhours parser and checkin the generated model to make i…
tpeng Oct 22, 2013
0fa6f9a
annotate IE pages for parsing address
tpeng Nov 1, 2013
e98d5e0
add notebook for training IE address parser
tpeng Nov 1, 2013
bbbd394
fix some IE annotated data
tpeng Nov 1, 2013
834ac42
add inside_bold_feature and also tokenize the comma on end of a string
tpeng Nov 1, 2013
44e1e3c
update IE address parser nb
tpeng Nov 1, 2013
84e5d46
add utils to do more cleanups in html
tpeng Nov 1, 2013
e1cb18a
update the IE address parser nb
tpeng Nov 1, 2013
c43e89d
move the htmls cleanups to HTML feature generator's subclass
tpeng Nov 4, 2013
815e0dd
convert the h2 to strong too
tpeng Nov 4, 2013
4962dec
fix typo
tpeng Nov 4, 2013
f487f9a
split on comma after remove the comma in digits
tpeng Nov 4, 2013
3a674d6
fix tokenizer
tpeng Nov 4, 2013
e3616b0
fix GATE broken br elements
tpeng Nov 6, 2013
6b9050b
retrain the IE address parser and check in the models
tpeng Nov 6, 2013
644c486
more fixes on the broken annotated pages by GATE
tpeng Nov 7, 2013
09d194a
Merge pull request #4 from scrapinghub/annotate-ie-address
tpeng Nov 7, 2013
c048a6b
update document and bump version
tpeng Nov 7, 2013
819cfce
training data fixes
kmike Dec 13, 2013
31a85fc
big refactoring
kmike Dec 13, 2013
01949fa
split features into token features and global features
kmike Dec 17, 2013
8e3f7e3
make HtmlToken.token unicode
kmike Dec 17, 2013
61553a6
«tag» now means NER tag
kmike Dec 24, 2013
ccaf723
HtmlLoader
kmike Dec 24, 2013
f7efc8b
added support for WebAnnotator > 1.14 title annotation feature
kmike Dec 25, 2013
848e94b
break interface again: fit/transform methods now accepts multiple seq…
kmike Dec 25, 2013
e1257eb
tokenization changes: split by «|», make tokenizer aware of some unic…
kmike Dec 26, 2013
2f5d34d
one more tokenization fix
kmike Dec 26, 2013
9867983
trainer for Wapiti CRF models
kmike Dec 26, 2013
00c328e
make HtmlTokenizer and HtmlFeatureExtractor work on lists of trees by…
kmike Dec 26, 2013
5a5b455
add load_trees helper for bulk loading data
kmike Dec 26, 2013
7f5ecb1
add WapitiCRF to top-level exports
kmike Dec 26, 2013
02b79f5
update requirements.txt
kmike Dec 26, 2013
00bdf22
remove WapitiChunker; add transform and score methods to WapitiCRF
kmike Dec 26, 2013
a09b0a4
attributes are renamed to fix serialization and __repr__
kmike Dec 27, 2013
8942108
HtmlLoader cleanup
kmike Dec 27, 2013
fcf9840
smart_join utility function
kmike Dec 27, 2013
0bf420e
add support for auto-extracting dev data for wapiti training
kmike Dec 27, 2013
8917fba
move load_trees to the bottom and expose it in webstruct top-level na…
kmike Dec 27, 2013
874a2d5
a couple of helpers for easier training and prediction
kmike Dec 27, 2013
89e5779
smarter smart_join
kmike Dec 27, 2013
04e1728
improved docstring for IobEncoder.group
kmike Jan 9, 2014
113ae9c
extract_raw method for model.NER
kmike Jan 9, 2014
62131d9
heuristic algorithm for grouping entities into clusters
kmike Jan 10, 2014
e809864
minor docstring fix
kmike Jan 10, 2014
3892a7c
[wip] gazetteers support
kmike Dec 17, 2013
81a567c
Drop prebuilt gazetteer features; better utils for creating own gazet…
kmike Jan 13, 2014
a262e0b
minor docstring fix for geonames.read_geonames; extract csv parameter…
kmike Jan 14, 2014
e2eb818
utility for reading zipped geonames files
kmike Jan 14, 2014
b205c04
import pandas only on demand
kmike Jan 14, 2014
ab797de
don’t remove forms and annoying tags by default
kmike Jan 18, 2014
0111de9
support passing LongestMatch instances to LongestMatchGlobalFeature
kmike Jan 29, 2014
6d8793c
split token_shape feature function into several smaller functions
kmike Jan 29, 2014
945f219
split prefix and suffix features
kmike Jan 29, 2014
545ae30
cut-off support for HtmlFeatureExtractor
kmike Jan 30, 2014
9064c58
Merge pull request #6 from kmike/refactor
kmike Feb 12, 2014
5e2a77b
move Ireland address parsing out
tpeng Feb 13, 2014
a609e29
Merge pull request #9 from scrapinghub/moving-ie-address-parser
kmike Feb 13, 2014
0c46c82
Simpler regex for email matching.
kmike Feb 21, 2014
e16d0f0
add dev requirements - they are needed only to run tests and build docs
kmike Feb 25, 2014
d513c33
add functions and classes from webstruct.model to top-level namespace
kmike Feb 25, 2014
a6d7758
rename htmltoken_lists argument to html_token_lists
kmike Feb 25, 2014
deb6aab
make marisa_trie import optional
kmike Feb 25, 2014
2ca2bc8
a lot of documentation improvements
kmike Feb 25, 2014
67867c9
requirements fixes
kmike Feb 25, 2014
8177cb9
allow to build docs without installing lxml and scikit-learn
kmike Feb 25, 2014
1a80ed9
requirements-doc.txt
kmike Feb 25, 2014
c72a4c4
DOC more docs
kmike Feb 26, 2014
e531fca
rename some attributes_ to attributes
kmike Feb 26, 2014
eac30ea
DOC better tutorial (work in progress)
kmike Feb 27, 2014
bafd2c9
DOC tutorial improvements
kmike Feb 27, 2014
4d3b47d
DOC minor tutorial fixes
kmike Feb 27, 2014
e5eec90
DOC split api.rst into several files and other doc improvements
kmike Feb 28, 2014
86d1ade
(backwards-incompatible) move create_wapiti_pipeline to webstruct.wapiti
kmike Feb 28, 2014
8a4c426
DOC tutorial improvements
kmike Feb 28, 2014
0433506
a hook for customizing NER.extract results
kmike Feb 28, 2014
626fc56
NER.extarct_groups method
kmike Feb 28, 2014
3d5efa2
DOC minor tutorial improvements
kmike Feb 28, 2014
a061c33
fix NER extract methods
kmike Feb 28, 2014
01de5be
DOC entity grouping docs
kmike Feb 28, 2014
e3ff287
DOC minor entity grouping docs fixes
kmike Feb 28, 2014
8736a34
DOC move NER and Entity Grouping chapters out of parent section
kmike Mar 1, 2014
bac8829
HtmlToken.root attribute
kmike Mar 3, 2014
b8f9c1f
Don't let __START/END_TAG__ special tokens appear in trees accessible…
kmike Mar 3, 2014
f2c3963
make encoding argument optional for html_document_fromstring
kmike Mar 3, 2014
1ca1761
start webstruct.webannotator module
kmike Mar 3, 2014
c69d9ec
HtmlTokenizer.detokenize_single method for undoing HtmlTokenizer.toke…
kmike Mar 3, 2014
3b1090a
WIP crfsuite support
tpeng Mar 3, 2014
ab7b3de
some small refactor and document improvement
tpeng Mar 5, 2014
b35923e
better stripping regex for __START/END_TAG__ tokens
kmike Mar 6, 2014
8002062
WebAnnotator writer
kmike Mar 8, 2014
36595a0
workaround for lxml < 3.1.2
kmike Mar 8, 2014
9556c2d
fix tostr in previous change
tpeng Mar 8, 2014
59e127b
Less aggressive cleaning: preserve scripts and stylesheets, but don't…
kmike Mar 12, 2014
af239ee
fix webstruct.webannotator handling of attributes that are valid in H…
kmike Mar 12, 2014
1117ef7
make it possible to have consistent colors in to_webannotator function
kmike Mar 18, 2014
5045e60
abandon CRF++ template in CRFsuite backend
tpeng Mar 18, 2014
e791950
fix test failures
tpeng Mar 18, 2014
9637efe
implement ngrams as global feature function
tpeng Mar 24, 2014
f767d7a
allow WebAnnotatorLoader handle nested annotation by introducing know…
tpeng Mar 26, 2014
1a5d66f
change known_tags to known_entities and make it optional for WebAnnot…
tpeng Mar 26, 2014
e404a13
fix error message
tpeng Mar 26, 2014
0e90395
fix error message again
tpeng Mar 26, 2014
e3e5972
DOC: add example for WebAnnotatorLoader
tpeng Mar 26, 2014
7d65b73
change known_entities from list to set in GateLoader too
tpeng Mar 26, 2014
d74a7f6
Merge pull request #11 from tpeng/fix-wa-loader
kmike Mar 26, 2014
c48e748
add us_contact_pages converted to WebAnnotator format
kmike Apr 21, 2014
7c6809d
small fixes to annotation guidelines
kmike Apr 21, 2014
373f62e
delete stale README.rst file which contents was migrated to docs
kmike Apr 21, 2014
70c7408
small cleanup
kmike Apr 21, 2014
f1a860b
DOC nuke old README; minor other documentation fixes.
kmike Apr 21, 2014
f194a1a
improved setup.py
kmike Apr 21, 2014
eea07a6
DOC notes about model development and other tutorial improvements
kmike Apr 21, 2014
1bfc028
DOC installation notes
kmike Apr 21, 2014
dcdbbbe
DOC Python 2.7 is required.
kmike Apr 21, 2014
a8d7f96
hello 0.2
kmike Apr 21, 2014
ccddce0
DOC better wording
kmike Apr 22, 2014
ebaa38c
DOC document webstruct.webannotator
kmike Apr 22, 2014
9f3420e
TST better test coverage for webstruct.utils
kmike Apr 22, 2014
6185516
ignore html coverage reports
kmike Apr 22, 2014
3abe528
TST better test coverage for webstruct.webannotator
kmike Apr 22, 2014
1d6e560
(backwards-incompatible) rename webstruct.tokenizers to webstruct.tex…
kmike Apr 22, 2014
4fb4419
split the PR into 2 parts
tpeng Apr 22, 2014
568862f
missing file in previous change
tpeng Apr 22, 2014
98b3ae4
Remove example notebooks and models from repo.
kmike Apr 22, 2014
e3defff
Merge pull request #10 from tpeng/crfsuite-backend
kmike Apr 22, 2014
40a5415
simplify CombinedFeatures and make it private
kmike Apr 22, 2014
45a005f
features.utils -> feature.global_features
kmike Apr 22, 2014
4ee7f40
TST fix tests
kmike Apr 22, 2014
a636a9a
replace Ngram global feature with Pattern
kmike Apr 22, 2014
04eed65
DOC fix autodocs
kmike Apr 22, 2014
115a5a4
DOC minor fixes
kmike Apr 23, 2014
52759bd
(backwards-incompatible) kill default features:
kmike Apr 23, 2014
a91d1c9
(backwards-incompatible) rename "transform" to "predict" for estimato…
kmike Apr 23, 2014
ab1b589
TST don't require NLTK for tests
kmike Apr 24, 2014
9204eec
simple __repr__ for HtmlToken
kmike Apr 24, 2014
829f708
(backwards-incompatible) all create_wapiti_pipeline wapiti params
kmike Apr 24, 2014
e52ab9e
WordTokenizer.tokenize rewritten
chekunkov May 5, 2014
98a2a0b
doctests indent
chekunkov May 5, 2014
989072c
fix unicode handling for a new tokenizer; add pounds char to rules
kmike May 12, 2014
177ad80
Merge branch 'speed_up_text_tokenizer' of https://github.com/chekunko…
kmike May 12, 2014
5fe04f6
Merge pull request #16 from scrapinghub/speed_up_text_tokenizer
kmike May 12, 2014
226e53f
small tokenizer cleanup
kmike May 13, 2014
24926c5
make min_length and max_length arguments required for utils.substrings
kmike May 14, 2014
b6d60f1
add crfsuite backend base on python-crfsuite
tpeng Apr 23, 2014
e3ef37a
DOC: fix crfsuite docstring
tpeng Apr 24, 2014
f96cae1
DOC fix style and typo
tpeng Apr 24, 2014
383f8b7
fix HtmlTokenizer pickling
kmike May 15, 2014
0adaaf2
WapitiCRF.fit returns self
kmike May 15, 2014
92553b7
train_test_split_noshuffle
kmike May 15, 2014
55598e0
TST runcoverage script
kmike May 15, 2014
a2111d4
python-crfsuite support; tests for NER and crfsuite pipeline
kmike May 15, 2014
01b0ee6
expose CRFsuiteCRF and CCRFsuiteFeatureEncoder
kmike May 16, 2014
0f248b6
rename wapiti_kwargs to crf_kwargs for consistency
kmike May 16, 2014
441ebf4
move tostr to wapiti module because it is wapiti-specific
kmike May 16, 2014
7d12376
NER.annotate and NER.annotate_url methods
kmike May 16, 2014
85e9407
Abstract temporary model files handling; add this feature to wapiti. …
kmike May 16, 2014
9525c46
A corpus (not annotated yet) with 450 pages from business websites in…
kmike May 19, 2014
38730d8
add EMAIL to dtd in order to load annotated files properly
kmike May 19, 2014
4619e8f
annotation fixes
kmike May 19, 2014
be9a91c
Fix html produced by WebAnnotator.
kmike May 19, 2014
591051d
(backwards incompatible) drop existing `load_trees`; rename `load_tre…
kmike May 20, 2014
5bb3768
make it possible to use existing WebAnnotator colors
kmike May 20, 2014
6cd6265
+100 annotated pages
kmike May 20, 2014
2e746c4
annotation fixes
kmike May 21, 2014
223d8f1
annotation fixes
kmike May 21, 2014
8875d3c
more annotation fixes
kmike May 21, 2014
146ad5e
+100 pages
kmike May 21, 2014
448048e
annotation fixes
kmike May 21, 2014
87279df
BUG fix an issue with WebAnnotatorLoader: it shouldn't add extra "Non…
kmike May 21, 2014
2150bda
fix a test after annotation fix
kmike May 21, 2014
79d81c5
easier Trainer customization for CRFsuiteCRF
kmike May 26, 2014
a98431e
X_dev and y_dev support for webstruct.crfsuite
kmike May 26, 2014
1c47f9e
+100 pages
kmike May 27, 2014
e9ebeaa
doctests (failing) for some tokenization gotchas
kmike May 27, 2014
f80c382
expose LongestMatchGlobalFeature
kmike May 27, 2014
1c17e7c
annotations fix
kmike May 27, 2014
17a5d4e
one more failing tokenization example
kmike May 27, 2014
9d8fcdc
webstruct.gazetteers.geonames.read_geonames_zipped: try to handle geo…
kmike May 28, 2014
ce775e6
DAWG gazetteers support (they are much faster than MARISA-based, but …
kmike May 28, 2014
6ee718f
more annotated data
kmike May 28, 2014
ed40e3e
CRFsuiteFeatureEncoder is not needed with python-crfsuite==0.6
kmike May 28, 2014
b2cb0e7
Undocumented HtmlFeatureExtractor post-processing step is removed to …
kmike May 28, 2014
649c814
bias feature
kmike May 28, 2014
12be72e
tiny speedup for BestMatch._find_matches
kmike May 28, 2014
727f61b
NER.extract_groups_from_url
kmike May 30, 2014
cd1860d
export webstruct.smart_join
kmike May 30, 2014
56cd57e
annotation fixes (more locations for about 70 pages)
kmike May 30, 2014
33e638d
tokenizer - dot regex fix. WordTokenizer refactoring to be able to re…
chekunkov Jun 7, 2014
960bc7b
Merge branch 'master' into tokenizer_additional_fixes_and_span_method
chekunkov Jun 7, 2014
a010b00
fixed broken doctests
chekunkov Jun 7, 2014
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ pip-log.txt
# Unit test / coverage reports
.coverage
.tox
cover
nosetests.xml

# Translations
Expand All @@ -35,5 +36,7 @@ nosetests.xml
.pydevproject

# Other
.idea
webstruct_data/datastore

.ipynb_checkpoints
docs/_build
27 changes: 27 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Webstruct
=========

Webstruct is a library for creating statistical NER_ systems that work
on HTML data, i.e. a library for building tools that extract named
entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only
on text data. This allows to define features that use HTML structure,
and also to embed annotation results back into HTML.

Read the docs_ for more info.

License is MIT.

.. _docs: http://webstruct.readthedocs.org/en/latest/
.. _NER: http://en.wikipedia.org/wiki/Named-entity_recognition

Contributing
------------

* Source code: https://github.com/scrapinghub/webstruct
* Bug tracker: https://github.com/scrapinghub/webstruct/issues

To run tests, make sure nose_ is installed, then run ``runtests.sh`` script.

.. _nose: https://github.com/nose-devs/nose
13 changes: 0 additions & 13 deletions block_model/README.md

This file was deleted.

11 changes: 0 additions & 11 deletions block_model/convert_html.py

This file was deleted.

16 changes: 0 additions & 16 deletions block_model/convert_labeled_data.py

This file was deleted.

132 changes: 0 additions & 132 deletions block_model/data/1.html

This file was deleted.

32 changes: 0 additions & 32 deletions block_model/data/1.txt

This file was deleted.

Loading