-
Notifications
You must be signed in to change notification settings - Fork 663
Python interface seems to not work. #356
Comments
Ok, at least part of this is #343, ~~and The fix suggested there does not work:
Anyways, the library path search is still completely broken. What system did a normally installed |
After jumping through the asinine build process for the google test framework, the tests are failing:
It looks like something, somewhere, is failing to link libpthread. I have no idea where to look with the build process, though. I'm only vaguely familiar with make, and have no idea about autoconf. |
Yes, python part doesn't work under py3.4. Multiple import errors, shitty formatting style, wrong c library loading code, multiple errors on trying to parse something using adapters. And these errors seem to be omni-versional, not related to py3. |
Sigil has a fixed version for python 3.4 along with a new beautiful soup 4 adapter that works with the version of the gumbo parser that has been specially modified for use inside Sigil. I am sure it could be easily fixed/adapted for the official gumbo parser. Let me know if you need or want a copy and I'll take a shot at adapting what we have to work here. |
Always wondered how can I find these specific about-demonic pictograms. |
That version is for specific use inside of Sigil's plugin python 2.7 and python3.4 environment. It is set up to work with BS4 (also used internally by Sigil) not html5lib but the bulk of it should be adaptable.
|
If it helps the sigil-python code we actually deploy to interface to gumbo is here: https://github.com/Sigil-Ebook/Sigil/tree/master/src/Resource_Files/plugin_launchers/python See: sigil_gumboc.py
|
I'm rewriting the binding here https://github.com/neumond/scutigera/tree/master/scutigera/gumbo Is it worth to use CFFI or Cython instead ctypes? PyPy recommends using CFFI (looks like they bet on JIT to optimize interaction) while Cython advertised in multiple articles across internet as fastest possible solution after pure CPython extension. |
Hmm. Now the question is what encoding does gumbo consider to use internally. It accepts a buffer of bytes. Some of tests fail at decoding output as utf-8. Even if it treats input as ascii, which is enough to do HTML parsing, it does need some encoding choice to decipher html entities like |
gumbo only works with properly utf-8 encoded html files. if an html file has any other encoding, it must be converted to utf-8 before being parsed by gumbo. See the readme on this site for details. Also the source being parsed must continue to exist be stored in memory as pointers into the original source exist in the parsed tree. |
Exactly as I supposed it to be. Well, for test // line 195
for (int j = 0; j < 8; j++){
std::cout << ((int) child->v.text.text[j]) << std::endl;
} It's Magic. By the way, how can I check html5lib test suite, some tests look unreasonable for me. Id est gumbo works properly and html5lib test expects wrong, e.g. for noscript tag test. UPD. It is an interesting character (65533) http://www.fileformat.info/info/unicode/char/0fffd/index.htm
UPD2. Very interesting :) b'FOO\xc7'.decode('utf-8', errors='replace').encode('utf-8')
b'FOO\xef\xbf\xbd' UPD3. Now it's better.
|
As far as I'm aware, Gumbo's output shouldn't ever be invalid UTF-8. Certainly, per spec, Gumbo should be outputting U+FFFD for that, and definitely shouldn't be output something broken!
A good starting point nowadays is look what your favourite browser does on the Live DOM Viewer though that doesn't work in the case of |
Ok, I think its time to dig gumbo code to repair this. Who knows maybe in some cases gumbo will output valid utf8 where it must output replacements. Regarding noscript, that's one of obscure things I didn't know about. Considering this example <p id="status"><noscript><strong>A</strong></noscript><span>B</span></p> If I inspect DOM in firefox with javascript turned on I have |
So if you use a numeric entity like & # 111111111111 ; (which takes minimum 5 bytes to even represent as hex) or any other illegal unicode code point, the spec says to output UxFFFD? Is that right? I know gumbo does output proper utf-8 encoded values for legal numeric entities. For example: & # x F F F D ; results in the proper utf-8 byte string of 0xEF 0xBF 0xBD in the serialized output. |
The problem is overflow of an int type in src/char_ref.c
Before each iteration for adding the next char digit it needs to check and prevent overflow of the codepoint value (anything greater than 0x10ffff) while still continuing to consume the bad numeric entity until it gets to a non-digit. If you look at 111111111111 as hex (0x19debd01c7) it overflows an int type and the last byte value is the one you are seeing in the output (0xc7). The overflow prevents this snippet of code from working:
The problem is char_ref.c is preprocessed to make char_ref.rl for speed, so once a proper fix is made, that the char_ref.rl will have to be recreated. Hope this helps. |
FWIW, since 0x10ffff * 16 easily fits inside an int, we do not need to catch int overlfow, we just need to catch overflow of 0x10ffff the first time but keep parsing until a non-digit. The final snippet (see above) will take care of the rest.. So this patch in char_ref.c did the trick for me:
|
That's right. FWIW, that bug looks almost identical to the Gecko bug that led to those tests being written; the fix LGTM.
I'd strongly encourage to use
If there's a test that expects the script enabled parsing, it should have the |
A simpler patch might be to remove bad_value and simply test if codepoint <= 0x10ffff before scaling the codepoint and adding the digit. Either way once it exceeds 0x10ffff it will stop updating and prevent any overflow. |
Trying to add script parameter into html5lib. [SOLVED. Found .pytest.expect file, https://github.com/gsnedders/pytest-expect] Somehow all tests with script-off are masked with xfail. I've commented out DataLossWarning try-catcher, multiple tests started to fail, but script-off ones are still xfail-masked. Even grepping 'xfail' didn't help, there's no such text in whole project. Command to run tests ignoring expect plugin: |
Looks like 48 tests are working correctly now. How far did I go in just using gumbo in my project.. |
Regarding pullrequest for gumbo-parser. I don't know. I have version that works well with py3 and html5lib only. At least now I can import and use it, and it can use system-wide gumbo installation. I guess @kevinhendricks has good implementation for beautiful soup, not sure whether it's py2 or py3. It has many changes including testing through drop-in replacement of native html5lib parser and removal of gumboc_tags.py. @nostrademons : what do you require for such PR? Do you require py2 and existing importing scheme? |
beautiful soup 4 adapter is python3. |
FWIW, to fix the issue of numeric overflow preventing invalid numeric entities (such as |
I fixed this problem by created a link from ln -s /usr/local/lib/libgumbo.so.1.0.0 /usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg/gumbo/libgumbo.so |
Nick-Alam's solution worked for me on Ubuntu14.04. pydoc3 gumbo lists the file in the gumbo package as: |
I'm trying to get gumbo to work with python on ubuntu 14.04, and not having much work.
I built and installed gumbo by cloning the master branch:
And then the python extensions:
On python 3:
I patched the import to
gumboc_tags
by just copying the contents of that file (it's just a single big array) intogumboc.py
, then fixed the library search path issue (I just hardcoded the library path to"/usr/local/lib/libgumbo.so.1.0.0"
), and it then imports, butgumbo.soup_parse
(which is what I want) doesn't seem to be present:I also attempted to see if the version in PyPi would work, and it's non-functional after install for python3 (my app is python 3, I tested python 2 just to be thorough).
The text was updated successfully, but these errors were encountered: