forked from jvanasco/metadata_parser
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGELOG.txt
193 lines (140 loc) · 7.55 KB
/
CHANGELOG.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
0.8.1
added 2 new properties to a computed MetadataParser object:
is_redirect = None
is_redirect_same_host = None
in the case of redirects, we only have the peername available for the final URL (not the source)
if a response is a redirect, it may not be for the same host -- and the peername would correspond to the destination URL -- not the origin
0.8.0
this bump introduces 2 new arguments and some changed behavior:
- `search_head_only=None`. previously the meta/og/etc data was only searched in the document head (where expected as per HTML specs).
after indexing millions of pages, many appeared to implement this incorrectly of have html that is so off specification that
parsing libraries can't correctly read it (for example, Twitter.com).
This is currently implemented to default from None to True, but future versions will default to `False`.
This is marked for a future default of `search_head_only=False`
- `raise_on_invalid`. default False. If True, this will raise a new exception: InvalidDocument if the response
does not look like a proper html document
0.7.4
- more aggressive attempts to get the peername.
0.7.3
- this will now try to cache the `peername` of the request (ie, the remote server) onto the peername attribute
0.7.2
- applying a `strip()` to the "title". bad authors/cms often have whitespace.
0.7.1
- added kwargs to docstrings
- `get_metadata_link` behavior has been changed as follows:
* if an encoded uri is present (starts with `data:image/`)
** this will return None by default
** if a kwarg of `allow_encoded_uri=True` is submitted, will return the encoded url (without a url prefix)
0.7.0
- merged https://github.com/jvanasco/metadata_parser/pull/9 from xethorn
- nested all commands to `log` under `__debug__` to avoid calls on production when PYTHONOPTIMIZE is set
0.6.18
- migrated version string into __init__.py
0.6.17
- added a new `DummyResponse` class to mimic popular attributes of a `requests.response` object when parsing from HTML files
0.6.16
- incorporated pull8 (https://github.com/jvanasco/metadata_parser/pull/8) which fixes issue5 (https://github.com/jvanasco/metadata_parser/issues/5) with comments
0.6.15
- fixed README which used old api in the example
0.6.14
- there was a typo and another bug that passed some tests on BeautifulSoup parsing. they have been fixed. todo- migrate tests to public repo
0.6.13
- trying to integrate a "safe read"
0.6.12
- now passing "stream=True" to requests.get. this will fetch the headers first, before looping through the response. we can avoid many issues with this approach
0.6.11
- now correctly validating urls with ports. had to restructure a lot of the url validation
0.6.10
- changed how some nodes are inspected. this should lead to fewer errors
0.6.9
- added a new method `get_metadata_link()`, which applies link transformations to a metadata in an attempt to ensure a valid link
0.6.8
- added a kwarg `requests_timeout` to proxy a timeout value to `requests.get()`
0.6.7
- added a lockdown to `is_parsed_valid_url` titled `http_only` -- requires http/https for the scheme
0.6.6
- protecting against bad doctypes, like nasa.gov
-- added `force_doctype` to __init__. defaults to False. this will change the doctype to get around to bs4/lxml issues
-- this is defaulted to False.
0.6.5
- keeping the parsed BS4 document; a user may wish to perform further operations on it.
-- `MetadataParser.soup` attribute holds BS4 document
0.6.4
- flake8 fixes. purely cosmetic.
0.6.3
- no changes. `sdist upload` was picking up a reference file that wasn't in github; that file killed the distribution install
0.6.2
- formatting fixes via flake8
0.6.1
- Lightweight, but functional, url validation
-- new 'init' argument (defaults to True) : `require_public_netloc`
-- this will ensure a url's hostname/netloc is either an IPV4 or "public DNS" name
-- if the url is entirely numeric, requires it to be IPV4
-- if the url is alphanumeric, requires a TLD + Domain ( exception is "localhost" )
-- this is NOT RFC compliant, but designed for "Real Life" use cases.
0.6.0
- Several fixes to improve support of canonical and absolute urls
-- replaced REGEX parsing of urls with `urlparse` parsing and inspection; too many edge cases got in
-- refactored `MediaParser.absolute_url` , now proxies a call to new function `url_to_absolute_url`
-- refactored `MediaParser.get_discrete_url` , now cleaner and leaner.
-- refactored how some tests run, so there is cleaner output
0.5.8
- trying to fix some issues with distribution
0.5.7
- trying to parse unparsable pages was creating an error
-- `MetadataParser.init` now accepts `only_parse_file_extensions` -- list of the only file extensions to parse
-- `MetadataParser.init` now accepts `force_parse_invalid_content_type` -- forces to parse invalid content
-- `MetadataParser.fetch_url` will only parse "text/html" content by default
0.5.6
- trying to ensure we return a valid url in get_discrete_url()
- adding in some proper unit tests; migrating from the private demo's slowly ( the private demo's hit a lot of internal files and public urls ; wouldn't be proper to make these public )
- setting `self.url_actual = url` on __init__. this will get overridden on a `fetch`, but allows for a fallback on html docs passed through
0.5.5
- Dropped BS3 support
- test Python3 support ( support added by Paul Bonser [ https://github.com/pib ] )
0.5.4
- Pull Request - https://github.com/jvanasco/metadata_parser/pull/1
Credit to Paul Bonser [ https://github.com/pib ]
0.5.3
- added a few `.strip()` calls to clean up metadata values
0.5.2
- fixed an issue on html title parsing. the old method incorrectly regexed on a BS4 tag, not tag contents, creating character encoding issues.
0.5.1
- missed the ssl_verify command
0.5.0
- migrated to the requests library
0.4.13
- trapping all errors in httplib and urrlib2 ; raising as an NotParsable and sticking the original error into the `raised` attribute.
this will allow for cleaner error handling
- we *really* need to move to requests.py
0.4.12
- created a workaround for sharethis hashbang urls, which urllib2 doesn't like
- we need to move to requests.py
0.4.11
- added more relaxed controls for parsing safe files
0.4.10
- fixed force_parse arg on init
- added support for more filetypes
0.4.9
- support for gzip documents that pad with extra data ( spec allows, python module doesn't )
- ensure proper document format
0.4.8
- added support for twitter's own og style markup
- cleaned up the beautifulsoup finds for og data
- moved 'try' from encapsulating 'for' blocks to encapsulating the inner loop. this will pull more data out if an error occurs.
0.4.7
- cleaned up some code
0.4.6
- realized that some servers return gzip content, despite not advertising that this client accepts that content ; fixed by using some ideas from mark pilgrim's feedparser. metadata_parser now advertises gzip and zlib, and processes it as needed
0.4.5
- fixed a bug that prevented toplevel directories from being parsed
0.4.4
- made redirect/masked/shortened links have better dereferenced url support
0.4.2
- Wrapped title tag traversal with an AttributeException try block
- Wrapped canonical tag lookup with a KeyError try block, defaulting to 'href' then 'content'
- Added support for `url_actual` and `url_info` , which persist the data from the urllib2.urlopen object's `geturl()` and `info()`
- `get_discrete_url` and `absolute_url` use the underlying url_actual data
- added support for passing data and headers into urllib2 requests
0.4.1
Initial Release