-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly detect encoding even without BOM #465
Merged
+1,275
−258
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
9b9e5fc
Use pretty_assertions for encoding tests
Mingun 82572f1
Use pretty_assertions when compare events in reader tests
Mingun a1b840e
Move fuzzing tests from encoding to a dedicated file
Mingun d2e5a6e
Specify required features for a test
Mingun f82e325
Remove unused `decode_with_bom_removal` method and free function
Mingun ad77e3f
Merge Decoder methods to avoid wrong remark about necessarily of `enc…
Mingun bf2a360
Remove excess test
Mingun 41c36b5
Move documents to test encodings to a sub-folder
Mingun 813dd20
Add tests for encoding detection
Mingun ba46694
Correctly detect UTF-16 encoding even without BOM
Mingun 6f303c6
Add warning about unsupported encodings
Mingun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,9 @@ | ||
# Unit tests assume that all xml files have unix style line endings | ||
/tests/documents/* text eol=lf | ||
/tests/documents/encoding/* text eol=lf | ||
|
||
/tests/documents/utf16be.xml binary | ||
/tests/documents/utf16le.xml binary | ||
/tests/documents/encoding/utf16be.xml binary | ||
/tests/documents/encoding/utf16le.xml binary | ||
/tests/documents/encoding/utf16be-bom.xml binary | ||
/tests/documents/encoding/utf16le-bom.xml binary | ||
/tests/documents/sample_5_utf16bom.xml binary |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
[submodule "encoding"] | ||
path = test-gen/encoding | ||
url = https://github.com/whatwg/encoding.git | ||
shallow = true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,7 +28,7 @@ pub(crate) const UTF16_BE_BOM: &[u8] = &[0xFE, 0xFF]; | |
/// key is not defined or contains unknown encoding. | ||
/// | ||
/// The library supports any UTF-8 compatible encodings that crate `encoding_rs` | ||
/// is supported. [*UTF-16 is not supported at the present*][utf16]. | ||
/// is supported. [*UTF-16 and ISO-2022-JP are not supported at the present*][utf16]. | ||
/// | ||
/// If feature `encoding` is disabled, the decoder is always UTF-8 decoder: | ||
/// any XML declarations are ignored. | ||
|
@@ -54,66 +54,38 @@ impl Decoder { | |
} | ||
} | ||
|
||
#[cfg(not(feature = "encoding"))] | ||
impl Decoder { | ||
/// Decodes a UTF8 slice regardless of XML declaration and ignoring BOM if | ||
/// it is present in the `bytes`. | ||
/// | ||
/// Returns an error in case of malformed sequences in the `bytes`. | ||
/// | ||
/// If you instead want to use XML declared encoding, use the `encoding` feature | ||
#[inline] | ||
pub fn decode<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> { | ||
Ok(Cow::Borrowed(std::str::from_utf8(bytes)?)) | ||
} | ||
|
||
/// Decodes a slice regardless of XML declaration with BOM removal if | ||
/// it is present in the `bytes`. | ||
/// | ||
/// Returns an error in case of malformed sequences in the `bytes`. | ||
/// | ||
/// If you instead want to use XML declared encoding, use the `encoding` feature | ||
pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> { | ||
let bytes = if bytes.starts_with(UTF8_BOM) { | ||
&bytes[3..] | ||
} else { | ||
bytes | ||
}; | ||
self.decode(bytes) | ||
} | ||
} | ||
|
||
#[cfg(feature = "encoding")] | ||
impl Decoder { | ||
/// Returns the `Reader`s encoding. | ||
/// | ||
/// This encoding will be used by [`decode`]. | ||
/// | ||
/// [`decode`]: Self::decode | ||
#[cfg(feature = "encoding")] | ||
pub const fn encoding(&self) -> &'static Encoding { | ||
self.encoding | ||
} | ||
|
||
/// ## Without `encoding` feature | ||
/// | ||
/// Decodes an UTF-8 slice regardless of XML declaration and ignoring BOM | ||
/// if it is present in the `bytes`. | ||
/// | ||
/// ## With `encoding` feature | ||
/// | ||
/// Decodes specified bytes using encoding, declared in the XML, if it was | ||
/// declared there, or UTF-8 otherwise, and ignoring BOM if it is present | ||
/// in the `bytes`. | ||
/// | ||
/// ---- | ||
/// Returns an error in case of malformed sequences in the `bytes`. | ||
pub fn decode<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> { | ||
decode(bytes, self.encoding) | ||
} | ||
#[cfg(not(feature = "encoding"))] | ||
let decoded = Ok(Cow::Borrowed(std::str::from_utf8(bytes)?)); | ||
|
||
/// Decodes a slice with BOM removal if it is present in the `bytes` using | ||
/// the reader encoding. | ||
/// | ||
/// If this method called after reading XML declaration with the `"encoding"` | ||
/// key, then this encoding is used, otherwise UTF-8 is used. | ||
/// | ||
/// If XML declaration is absent in the XML, UTF-8 is used. | ||
/// | ||
/// Returns an error in case of malformed sequences in the `bytes`. | ||
pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> { | ||
self.decode(remove_bom(bytes, self.encoding)) | ||
#[cfg(feature = "encoding")] | ||
let decoded = decode(bytes, self.encoding); | ||
|
||
decoded | ||
} | ||
} | ||
|
||
|
@@ -127,43 +99,14 @@ pub fn decode<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> Result<Cow<'b | |
.ok_or(Error::NonDecodable(None)) | ||
} | ||
|
||
/// Decodes a slice with an unknown encoding, removing the BOM if it is present | ||
/// in the bytes. | ||
/// | ||
/// Returns an error in case of malformed or non-representable sequences in the `bytes`. | ||
#[cfg(feature = "encoding")] | ||
pub fn decode_with_bom_removal<'b>(bytes: &'b [u8]) -> Result<Cow<'b, str>> { | ||
if let Some(encoding) = detect_encoding(bytes) { | ||
let bytes = remove_bom(bytes, encoding); | ||
decode(bytes, encoding) | ||
} else { | ||
decode(bytes, UTF_8) | ||
} | ||
} | ||
|
||
#[cfg(feature = "encoding")] | ||
fn split_at_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> (&'b [u8], &'b [u8]) { | ||
if encoding == UTF_8 && bytes.starts_with(UTF8_BOM) { | ||
bytes.split_at(3) | ||
} else if encoding == UTF_16LE && bytes.starts_with(UTF16_LE_BOM) { | ||
bytes.split_at(2) | ||
} else if encoding == UTF_16BE && bytes.starts_with(UTF16_BE_BOM) { | ||
bytes.split_at(2) | ||
} else { | ||
(&[], bytes) | ||
} | ||
} | ||
|
||
#[cfg(feature = "encoding")] | ||
pub(crate) fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'b [u8] { | ||
let (_, bytes) = split_at_bom(bytes, encoding); | ||
bytes | ||
} | ||
|
||
/// Automatic encoding detection of XML files based using the | ||
/// [recommended algorithm](https://www.w3.org/TR/xml11/#sec-guessing). | ||
/// | ||
/// If encoding is detected, `Some` is returned, otherwise `None` is returned. | ||
/// If encoding is detected, `Some` is returned with an encoding and size of BOM | ||
/// in bytes, if detection was performed using BOM, or zero, if detection was | ||
/// performed without BOM. | ||
/// | ||
/// IF encoding was not recognized, `None` is returned. | ||
/// | ||
/// Because the [`encoding_rs`] crate supports only subset of those encodings, only | ||
/// the supported subset are detected, which is UTF-8, UTF-16 BE and UTF-16 LE. | ||
|
@@ -173,25 +116,26 @@ pub(crate) fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &' | |
/// | ||
/// | Bytes |Detected encoding | ||
/// |-------------|------------------------------------------ | ||
/// |`FE FF ## ##`|UTF-16, big-endian | ||
/// | **BOM** | ||
/// |`FE_FF_##_##`|UTF-16, big-endian | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Underscores added because otherwise long text in lower cells shrink the first column and leads to text wrapping, which looks not good |
||
/// |`FF FE ## ##`|UTF-16, little-endian | ||
/// |`EF BB BF` |UTF-8 | ||
/// |-------------|------------------------------------------ | ||
/// | **No BOM** | ||
/// |`00 3C 00 3F`|UTF-16 BE or ISO-10646-UCS-2 BE or similar 16-bit BE (use declared encoding to find the exact one) | ||
/// |`3C 00 3F 00`|UTF-16 LE or ISO-10646-UCS-2 LE or similar 16-bit LE (use declared encoding to find the exact one) | ||
/// |`3C 3F 78 6D`|UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably | ||
#[cfg(feature = "encoding")] | ||
pub fn detect_encoding(bytes: &[u8]) -> Option<&'static Encoding> { | ||
pub fn detect_encoding(bytes: &[u8]) -> Option<(&'static Encoding, usize)> { | ||
match bytes { | ||
// with BOM | ||
_ if bytes.starts_with(UTF16_BE_BOM) => Some(UTF_16BE), | ||
_ if bytes.starts_with(UTF16_LE_BOM) => Some(UTF_16LE), | ||
_ if bytes.starts_with(UTF8_BOM) => Some(UTF_8), | ||
_ if bytes.starts_with(UTF16_BE_BOM) => Some((UTF_16BE, 2)), | ||
_ if bytes.starts_with(UTF16_LE_BOM) => Some((UTF_16LE, 2)), | ||
_ if bytes.starts_with(UTF8_BOM) => Some((UTF_8, 3)), | ||
|
||
// without BOM | ||
_ if bytes.starts_with(&[0x00, b'<', 0x00, b'?']) => Some(UTF_16BE), // Some BE encoding, for example, UTF-16 or ISO-10646-UCS-2 | ||
_ if bytes.starts_with(&[b'<', 0x00, b'?', 0x00]) => Some(UTF_16LE), // Some LE encoding, for example, UTF-16 or ISO-10646-UCS-2 | ||
_ if bytes.starts_with(&[b'<', b'?', b'x', b'm']) => Some(UTF_8), // Some ASCII compatible | ||
_ if bytes.starts_with(&[0x00, b'<', 0x00, b'?']) => Some((UTF_16BE, 0)), // Some BE encoding, for example, UTF-16 or ISO-10646-UCS-2 | ||
_ if bytes.starts_with(&[b'<', 0x00, b'?', 0x00]) => Some((UTF_16LE, 0)), // Some LE encoding, for example, UTF-16 or ISO-10646-UCS-2 | ||
_ if bytes.starts_with(&[b'<', b'?', b'x', b'm']) => Some((UTF_8, 0)), // Some ASCII compatible | ||
|
||
_ => None, | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion makes sense in isolation, but I don't know that we want to contribute to even more churn than necessary? It's likely to either break or be unnecessary in the very next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just want to explicitly warn users about that UTF-16 and ISO-2022-JP are not supported for now. For example, parsing documents with some Chinese characters that represented as
[ASCII byte, some byte]
in UTF-16BE or[some byte, ASCII byte]
in UTF16LE can confuse the parser. That can be avoided if you stop to processing such documents in the very beginning.I want make this change because I want to cut release this weekend, and it is still unclear when the correct solution will be ready, so it is best to avoid using problematic encodings for now.
I hope that in the next release this note will be removed. It should not break anything -- once correct support will be implemented, users will be updated and remove their guard code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a remark, that restriction is temporary and will be eliminated once #158 is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I do think finishing #158 this weekend is plausible but I won't promise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, an intermediate "good enough to release" stage of it, in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it would be, except for async. Keep forgetting about that...