-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How would I parse character references as literal bytes and not codepoints? #667
Comments
I may be wrong, but it seems that you should use <element>😃</element> instead. Character references are supposed to refer to the Unicode code points directly, not to bytes in some unspecified encoding. A non-normative confirmation of this can be found, for example, here (just the first site from Google), HTML entity for the U+1F603 is |
Right, that is what I would do if given the opportunity. Unfortunately the program that generates these doesn't do it right and I'm left trying to parse it correctly. I'm filing a bug report with them, but it could take however long to get fixed if it ever does and in the meantime I still have to parse their files. |
Then it seems that it just writes UTF-8 encoded byte arrays for some characters and that byte arrays are encoded as lists of character references. You have to decode the string yourself. Get the raw data using |
After merging #766 you will able to resolve character references as you wish (but only in text, not in values of attributes) |
I have an element like this:
If those characters are literally interpreted, they should be the byte sequence
f0 9f 98 83
, which should be U+1F603, or😃
. Instead, it expands toc3 b0 c2 9f c2 98 c2 83
(this sequence is not printable, but you may inspect it here).This is very much how this is meant to work, and I am aware of that. Unfortunately this decision wasn't made nor is it controlled by me. So, I'd like to know if there's an obvious way to change how escapes are done without having to do it by just iterating through the bytes returned by a Text event.
The text was updated successfully, but these errors were encountered: