-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add date features #58
Open
Kebniss
wants to merge
450
commits into
master
Choose a base branch
from
date-features
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
450 commits
Select commit
Hold shift + click to select a range
98b3ae4
Remove example notebooks and models from repo.
kmike e3defff
Merge pull request #10 from tpeng/crfsuite-backend
kmike 40a5415
simplify CombinedFeatures and make it private
kmike 45a005f
features.utils -> feature.global_features
kmike 4ee7f40
TST fix tests
kmike a636a9a
replace Ngram global feature with Pattern
kmike 04eed65
DOC fix autodocs
kmike 115a5a4
DOC minor fixes
kmike 52759bd
(backwards-incompatible) kill default features:
kmike a91d1c9
(backwards-incompatible) rename "transform" to "predict" for estimato…
kmike ab1b589
TST don't require NLTK for tests
kmike 9204eec
simple __repr__ for HtmlToken
kmike 829f708
(backwards-incompatible) all create_wapiti_pipeline wapiti params
kmike e52ab9e
WordTokenizer.tokenize rewritten
chekunkov 98a2a0b
doctests indent
chekunkov 989072c
fix unicode handling for a new tokenizer; add pounds char to rules
kmike 177ad80
Merge branch 'speed_up_text_tokenizer' of https://github.com/chekunko…
kmike 5fe04f6
Merge pull request #16 from scrapinghub/speed_up_text_tokenizer
kmike 226e53f
small tokenizer cleanup
kmike 24926c5
make min_length and max_length arguments required for utils.substrings
kmike b6d60f1
add crfsuite backend base on python-crfsuite
tpeng e3ef37a
DOC: fix crfsuite docstring
tpeng f96cae1
DOC fix style and typo
tpeng 383f8b7
fix HtmlTokenizer pickling
kmike 0adaaf2
WapitiCRF.fit returns self
kmike 92553b7
train_test_split_noshuffle
kmike 55598e0
TST runcoverage script
kmike a2111d4
python-crfsuite support; tests for NER and crfsuite pipeline
kmike 01b0ee6
expose CRFsuiteCRF and CCRFsuiteFeatureEncoder
kmike 0f248b6
rename wapiti_kwargs to crf_kwargs for consistency
kmike 441ebf4
move tostr to wapiti module because it is wapiti-specific
kmike 7d12376
NER.annotate and NER.annotate_url methods
kmike 85e9407
Abstract temporary model files handling; add this feature to wapiti. …
kmike 9525c46
A corpus (not annotated yet) with 450 pages from business websites in…
kmike 38730d8
add EMAIL to dtd in order to load annotated files properly
kmike 4619e8f
annotation fixes
kmike be9a91c
Fix html produced by WebAnnotator.
kmike 591051d
(backwards incompatible) drop existing `load_trees`; rename `load_tre…
kmike 5bb3768
make it possible to use existing WebAnnotator colors
kmike 6cd6265
+100 annotated pages
kmike 2e746c4
annotation fixes
kmike 223d8f1
annotation fixes
kmike 8875d3c
more annotation fixes
kmike 146ad5e
+100 pages
kmike 448048e
annotation fixes
kmike 87279df
BUG fix an issue with WebAnnotatorLoader: it shouldn't add extra "Non…
kmike 2150bda
fix a test after annotation fix
kmike 79d81c5
easier Trainer customization for CRFsuiteCRF
kmike a98431e
X_dev and y_dev support for webstruct.crfsuite
kmike 1c47f9e
+100 pages
kmike e9ebeaa
doctests (failing) for some tokenization gotchas
kmike f80c382
expose LongestMatchGlobalFeature
kmike 1c17e7c
annotations fix
kmike 17a5d4e
one more failing tokenization example
kmike 9d8fcdc
webstruct.gazetteers.geonames.read_geonames_zipped: try to handle geo…
kmike ce775e6
DAWG gazetteers support (they are much faster than MARISA-based, but …
kmike 6ee718f
more annotated data
kmike ed40e3e
CRFsuiteFeatureEncoder is not needed with python-crfsuite==0.6
kmike b2cb0e7
Undocumented HtmlFeatureExtractor post-processing step is removed to …
kmike 649c814
bias feature
kmike 12be72e
tiny speedup for BestMatch._find_matches
kmike 727f61b
NER.extract_groups_from_url
kmike cd1860d
export webstruct.smart_join
kmike 56cd57e
annotation fixes (more locations for about 70 pages)
kmike 4019595
DOC suggest to use "Save as" in WebAnnotator
kmike c0448c9
get rid of seqlearn dependency
tpeng 3dc2024
fix document
tpeng 6e36995
Merge pull request #23 from tpeng/remove-seqlearn-deps
kmike 9cfe657
Update requirements so that they will work automatically
Suor 1c4c378
Set up tox to test py27, py33, py34 and docs
Suor 3060950
Add Travis CI config
Suor ee25440
Use miniconda to test on Travis CI
Suor 225cc76
Merge pull request #28 from Suor/travis
kmike c7c79b5
Migrate code to support Python 3
Suor 4de7573
Rename cross module to compat
Suor b5d19c8
Get rid of bprint()/bformat()
Suor 0e66518
Return to more natural doctest in HtmlTokenizer.tokenize_single()
Suor d2d3d5c
Set ELLIPSIS and IGNORE_UNICODE as default doctest options
Suor 22c27c4
Add Python 3 version modifiers to setup.py
Suor 86d44e6
Update python version requirements in installation docs
Suor d7e2fae
Merge pull request #29 from Suor/py3-clean
kmike b3be38c
add Travis badge to readme
kmike eba8084
fix requirements.txt: cython is no longer needed; bump python-crfsuit…
kmike a54dae3
Fix setup.py requires
Suor d21a83f
fixing typo: toolikit -> toolkit
carlosp420 8133674
Merge pull request #31 from carlosp420/patch-0
kmike 06be1b4
declare Python 3.5 support
kmike d8f1d0a
bump version to 0.3
kmike 6d3d109
Merge pull request #30 from Suor/master
kmike 005c88b
fixed compatibility with recent scikit-learn
kmike f8fa440
TST simplify travis.yml. See GH-33.
kmike d043435
TST don’t test with Python 3.3
kmike 0dfc6ac
TST don’t run tests twice for pull requests
kmike e7d552e
Merge pull request #34 from scrapinghub/fix-ci
kmike 920df38
(backwards incompatible) remove custom CRFsuite wrapper, use sklearn-…
kmike c49301f
Merge pull request #35 from scrapinghub/sklearn-crfsuite
kmike 93fc8c2
DOC more documentation for webstruct_data datasets
kmike db287d2
annotation fixes: emails, org names
kmike 2c611c4
preserve comments in loaded trees
kmike 03a82b4
annotations: remove problematic js code
kmike c51140a
DOC clarify known_entities of GateLoader
kmike 1d0f4ac
add country names gazetteer
kmike 9000067
TST switch to pytest, check that docs are building without warnings
kmike 71f1e34
gitignore more files
kmike 54b61a6
TST revert strict doc check
kmike a509bcb
Update codecov.yml
kmike e0fde7e
add codecov badge
kmike 51684c0
DOC whoops, fix whitespaces in README
kmike 3e05642
fixed NER.extract_groups_from_url `dont_penalize` argument
kmike d44d6f4
extract_entity_groups utility function
kmike 6000221
move HtmlTokenizer to its own module
kmike 8e5d98c
DOC trying to fix readthedocs build
kmike 786a1f0
DOC try to fix readthedocs, again..
kmike 9b7986b
bump version to 0.4; add changelog
kmike 784fd3a
DOC typo fixes
kmike 071bc78
fixed NER.extract bug
kmike 628c8c2
bump version
kmike b63b9bb
webstruct.infer_domain
kmike 5c33d14
TST create html coverage report locally by default
kmike 1856b46
style fix: proper blank lines in imports
kmike 97d6d37
Merge pull request #38 from scrapinghub/infer-domain
kmike c4786cb
preserve URL in <base> tag
kmike 1499ad0
Merge pull request #39 from scrapinghub/wa-baseurl
kmike dfe77c2
switch to requests
kmike 13d4437
a few countries.txt gazetter improvements
kmike 0d6eaf7
fixed warning when reading geonames
kmike 8c43f41
ignore more files in gitignore
kmike a5282a7
DOC more badges in README
kmike 5a3f39e
v0.5
kmike 8fb60d3
A complete example (contact extraction). See GH-24.
kmike 9949492
DOC fix example README
kmike f656552
DOC mention requirements.txt in the example's README
kmike 56913e2
hand made annotation
3aeb2d6
fix annotations
c016135
Merge pull request #41 from whalebot-helmsman/master
kmike 7a42a23
add description for punctuation removing (#42)
whalebot-helmsman 210f81a
more annotations
whalebot-helmsman e4fac51
more annotations
whalebot-helmsman 2df7c24
more annotations
whalebot-helmsman b3956cc
correct ids
whalebot-helmsman 51ef652
correct ids
whalebot-helmsman f1e002c
does not copy wa-title attributes
whalebot-helmsman 807f3a6
verify conversion
whalebot-helmsman 57bc016
convert annotation
whalebot-helmsman 02aad41
write as html
whalebot-helmsman b7e1e17
move gate annotations to webannotator
whalebot-helmsman eb97aa5
tests for html tools
whalebot-helmsman e541806
pep8 style
whalebot-helmsman d42dcea
add program description
whalebot-helmsman 37b8728
pep8 style
whalebot-helmsman 80a6fb8
pep8 style
whalebot-helmsman dca5dd3
add program description
whalebot-helmsman 9e480d4
pep8 style
whalebot-helmsman c1a1175
ability to pass entities list to verify
whalebot-helmsman 5feb78a
look for annotations in WebAnnotator folder
whalebot-helmsman 6d59b83
pep8
whalebot-helmsman 2a7b013
test attribute removal for wa-title
whalebot-helmsman 3556a57
Merge pull request #47 from whalebot-helmsman/master
kmike bc26275
mess is gone
whalebot-helmsman 4130d78
no need for gate loader
whalebot-helmsman c2af278
Merge pull request #48 from whalebot-helmsman/master
kmike 36d56f2
text tokenizer return postions of token
whalebot-helmsman 2d4d2ef
update tests
whalebot-helmsman 80658ca
separate statement for every action
whalebot-helmsman c52e449
comma preserving test
whalebot-helmsman 8178776
too much tokens around
whalebot-helmsman 51c0932
encode in indices instead of entities
whalebot-helmsman 1a667ec
handle empty lists
whalebot-helmsman 24465b1
pass token length and position from TextToken to HtmlToken
whalebot-helmsman 06befbb
letter perfect detokenization
whalebot-helmsman e5730b2
do not cleanup tokenized tree by default, separate method for tree cl…
e340444
update tests for separate tree cleaning
89673c1
update tests for correct punctuation positions
7c45984
correct length for replaced quotes
46fc4df
pep8
90bdefd
new html tree based to webannotator transformer
1fb67a0
ignore scripts and styles
3117640
ignore elements with non-text tokens
084fb33
as we search use our regexp for text and tail in same moment, our sta…
43449a1
pep8
388170e
comma at line end, not start
71caf61
one join instead of many additions, dont be Schleimel
37d7470
correct formatting
e93c6dc
add clarification
e02c275
fix typo
f26569f
pep8
d1aecbb
preserve tokenize method for compatibility
35a9d88
function to reduce code in tests
9033188
remove test for nltk tokenizer
c14f363
test our behaviour, which difers from original treebank tokenizer
a071cd4
remove useless conversion
a33f564
rename method to avoid confusion with nltk tokenize_span method
75a9698
remove brittle tests
4729323
small benchmark for html tokenizer
943a44e
Revert "remove brittle tests"
whalebot-helmsman ba7d6fe
move brittle tests to pytest xfail
whalebot-helmsman b72bcc1
expect behaviour of nltk tokenizer
whalebot-helmsman f9190c3
Merge pull request #49 from whalebot-helmsman/master
kmike 09f1699
Merge branch 'master' into webannotator-html
whalebot-helmsman 281d4a5
rename variable
whalebot-helmsman a0d2519
make TagPosition private
whalebot-helmsman caa76cc
make translate_to_dfs private
whalebot-helmsman 500ccf4
make fabricate_start/end private
whalebot-helmsman a743aed
make enclosure private
whalebot-helmsman f7e7a86
move enclosure deciding to separate function
whalebot-helmsman 91c3962
rename generic tasks to concrete enclosures
whalebot-helmsman 9e3b49a
move dfs order numbering to separate function
whalebot-helmsman 3266427
move start/end tag locating in separate function
whalebot-helmsman 7d56973
pep8
whalebot-helmsman 1dc3f28
high level explanation of whats heppening here
whalebot-helmsman a92a339
no unicode tags, so string_types is enough
whalebot-helmsman 833603b
reduce code
whalebot-helmsman 4f22537
Merge pull request #50 from whalebot-helmsman/master
kmike ced2fd8
tutorial rewritten with usage of crfsuite
sibiryakov 67763e6
wapiti link restored
sibiryakov 770d777
Merge pull request #52 from scrapinghub/crfsuite-tutorial
kmike 0bb8fd7
wapiti return bytes, not str
whalebot-helmsman 2d92efb
collect all top N results but return only first of them
whalebot-helmsman b801d7a
merge top N chains for better recall
whalebot-helmsman 739e269
benchmark script for model prediction
whalebot-helmsman d8afda6
we need newer wapiti version for python3 support
whalebot-helmsman 0d92091
add various overlapping schemes for chains
whalebot-helmsman 3842740
add description of merging method
whalebot-helmsman 83b5327
Merge pull request #55 from whalebot-helmsman/master
kmike 1713694
there are various types of unusual tags, not only comments
whalebot-helmsman 7a68569
Merge pull request #56 from whalebot-helmsman/master
kmike 0176cdb
non-recursive implementation of algorithm
whalebot-helmsman f4a1896
add description of WordTokenizer improvements
whalebot-helmsman 3e09c9f
changd comment as code structure changed
whalebot-helmsman bff4c3e
Merge pull request #57 from whalebot-helmsman/master
kmike d8b1984
don't declare Python 3.3 support
kmike d5a7fcf
v0.6
kmike e7a9716
Add date features
Kebniss 539b20c
Remove XX\XX\XXXX from looks_like_date_pattern because regex was not …
Kebniss 9cc8b80
Add todo list for solving small bugs
Kebniss f838fcc
put test_pattern_features.py back
Kebniss 2e77ec2
Remove duplicate code from tests
Kebniss d5f7deb
looks_like_day_ordinal True only for numbers between 0 and 32
Kebniss 2910304
add looks_like_ordinal and remove looks_like_ordinal_day + modify tests
Kebniss c848bfe
force all values to be string in order to join them
Kebniss 3aef718
Fix looks_like_date and tests
Kebniss f68431f
Cast values to string using py2 and py3 compatible method
Kebniss f8fa819
Update .travis.yml
Kebniss 4d0fb06
swicth re.fullmatch to anchors for compatibility with py2
Kebniss cd82fc7
Merge branch 'date-features' of github.com:scrapinghub/webstruct into…
Kebniss 09c4e0a
speed looks_like_date + rename looks_like_ordinal_en + fix tests
Kebniss c1b71f6
Copy w3lib/to_native_str in utils + remove w3lib dependency
Kebniss e906ccd
Remove todo
Kebniss 2aced50
remove to_native_str
Kebniss 4646095
fix list comprehension
Kebniss File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct in Python 2, as you'll be casting unicode features to str (i.e. to bytes).