-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use memchr to search for characters to escape #664
base: master
Are you sure you want to change the base?
Conversation
I tried this a while back with |
Yeah, I actually just posted a PR to jetscii, and found that even for cases that needed multiple memchr calls, memchr seemed to be faster, and based on the fact that the benchmarks were xml-related, came here to try. 😄 |
Yes, I've seen some interesting ideas about using lookup tables and/or avx256 to get really good performance but haven't really investigated further. |
I believe this is quite a bit faster, because rust only has to verify that each string slicing operation starts/ends at character boundaries, none of the inner bytes need to be checked for UTF8-ness since they're already from a `&str` Also, when initially creating the escaped string, preallocate a little extra room, since we know the string will grow.
56841c9
to
dcb2104
Compare
I was also playing with a bitset lookup table in master...Dr-Emann:quick-xml:bitset_escape |
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## master #664 +/- ##
==========================================
+ Coverage 64.63% 64.68% +0.05%
==========================================
Files 36 37 +1
Lines 17289 17618 +329
==========================================
+ Hits 11175 11397 +222
- Misses 6114 6221 +107
Flags with carried forward coverage won't be shown. Click here to find out more.
|
How did that go? Do you have results for that? |
It looks like using a bitmap is better than master, but this PR is better than both nearly across the board, at least on my M1 mac. Note, the results for this PR are a little better than what is in the original PR description, thanks to dcb2104. master vs bitmap on M1 mac
bitmap vs this PR, M1 mac
Combined results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some benchmarks shows regression about 2% in speed on two of my machines (Ubuntu 22.04 and Windows 7), mostly *_no_copy
benchmarks, but other shows improvements about 2-7%. In summary, there are more improvements than regressions, so I think we can merge this.
pub(crate) struct MergeIter<It1, It2> | ||
where | ||
It1: Iterator, | ||
It2: Iterator<Item = It1::Item>, | ||
{ | ||
it1: Peekable<It1>, | ||
it2: Peekable<It2>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you add several tests to check that iterator work as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! @dralley , feel free to merge when you ready.
e6889ff
to
c6ec23a
Compare
My results are quite a bit more mixed, on an R5 3600 CPU. I'm actually seeing a few more whole-document regressions than improvements (but mostly it's just a wash). Generally speaking I'm seeing more of a penalty than you are on short strings and less of a benefit on long strings. I'm going to keep looking at this tomorrow, do a bit more testing, and paste my results. |
Note, things might get even muddier: There are likely improvements to |
Great to hear! Thanks for investigating this I'll try out my PR with your jetscii changes tomorrow and see if it changes things |
R5 3600
i7-8665U
This is a bit of a complex decision. I don't think jetscii does anything on ARM, so the mergeiter approach is probably still a winner there. And in many cases it's the winner against jetscii on x86, but it might be completely document dependent. Short strings with no escaping is a very very common case in a lot of documents and maybe enough so that jetscii would come out ahead. mergeiter actually regresses from baseline there on x86, possibly by enough to end up worse overall? We should probably test this somehow, but the macrobenchmarks that currently exist, aren't doing any "escaping" at all, just unescaping. I assume that is why those benchmarks show practically zero difference. We should probably create a benchmark that parses those documents into an event stream (outside the benchmark) and then writes them back to a buffer, in such a way that the escaping / construction cost is captured. |
Some ideas:
I think that's probably all I've got. Daniel Lemire's blog might be worth checking out. |
Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment)) On my M1, tt appears to be slower but competitive with memchr up to memchr3, then start being the from 5-16
The Teddy results are pretty promising looking in the jetscii benchmarks for me (faster than merging memchrs), although it does seem a fair bit slower for small haystacks, so we might need to do one algorithm for a small haystacks (or CPUs teddy doesn't support), and use teddy for longer ones |
Use memchr iterators to search for characters to escape.
In some (most) cases, we need to combine multiple memchr searches (since memchr allows searching for up to 3 chars at the same time), so introduce a
MergeIter
type, which takes two iterators, and combines them in order.This appears to be better for performance in almost all cases,
except it's very slightly slower for escaping small strings with escapes presentthanks to dcb2104, it's actually faster even in that case. However, it's MUCH faster when there are no characters to escape, or when escaping a long string.Only results shown with >1% change, reported by
critcmp
Results for M1 Pro Macbook
Results for x64 i5-6600K Windows