-
Notifications
You must be signed in to change notification settings - Fork 114
Limit number of attributes per element to prevent DoS #143
Conversation
Hi, Thanks for this PR. Like the max tree depth option, I think this is a fine thing to add. Looking at the code, checking for duplicate attribute names is clearly O(n^2) in the number of names. (This could be amortized constant time using a hash set. It may be worth measuring how much of a slow down that is for the normal case at some point in the future. I'll make a note about it.) I think hitting this limit should be reported to the user in some fashion as silently dropping attributes seems bad. For hitting the tree depth limit, we raise an argument error. Should we do the same thing here or something else? |
Thanks for the follow up. Now that you mention it, it‘s true that silently ignoring attributes VS raising for exceeded tree depth isn’t consistent. In our application code base, we actually rescue the To be consistent, I can tweak this new code to raise That said, I’m curious to see whether the O(n^2) issue you noticed will make my PR here irrelevant. Let me know how you see it. |
I just tested the script with plain Nokogiri and it suffers the same issue, even worse:
So falling back to Nokogiri is not an option here. It would be really ideal to silently ignore attributes past X in our app’s case. I wonder what would be a proper design. Some Perhaps if we just heavily increase Here’s what I get with
|
Having to fall back to Nokogiri seems like a Nokogumbo design flaw. I can't recall what the maximum tree depth was designed to prevent from happening. Maybe there's some other superlinear behavior I'm not remembering. Without committing to any particular API, I could envision all of these behaviors being reasonable and desirable in some situation.
I think any of them other than 1 need to be opt-in. And depending on the signaling mechanism, 2 and 3 could be the same thing. E.g., after opting in, the user would have to check some status. Would any of those options work for you? Just as an aside since computer security is my focus, I can't help but think of an attack where a maximum attribute count could be abused to show different content to your users vs. other software without such a limit. If the attribute limit is n, then
The conditional content would be displayed with an attribute limit but would not be displayed without it. |
Of course. Everything you said makes sense. Option 4 may not be crucial though. And after some thinking, option 1 alone would be good enough in our app’s context. I will change the current PR to raise |
TL;DR The patch seems reasonable to me (although I haven't tested). @rubys, thoughts? I was so sure that the O(n^2) around checking for duplicate attributes behavior was the problem that I decided to fix the issue. I realized that if you're concerned about malicious input, then a hash set is unlikely to be helpful because it's quite tricky to produce one with guaranteed worst-case behavior. So I implemented a simple red-black tree to use as an attribute cache while parsing them and ran your script expecting significantly better behavior.
So obviously that's not the bottle neck. I ran with
But that's not helpful because
Slightly different format from So let's take a look at the code. /*
* Add it at the end to preserve parsing order ...
*/
if (node != NULL) {
if (node->properties == NULL) {
node->properties = cur;
} else {
xmlAttrPtr prev = node->properties;
while (prev->next != NULL)
prev = prev->next;
prev->next = cur;
cur->prev = prev;
}
} And there it is. That
The Finally, just to compare with the current implementation (without your patch), here are the top functions. Timing script:
Top functions in the time profile.
In contrast, with the red-black tree, Of course, the overall speed up is about 2 which seems reasonable if I've removed one of two O(n^2) loops. Unfortunately, I cannot do anything about |
Thanks a lot for this awesome investigation. If you guys want to go forward and merge this, let me know what’s your common practice regarding README updates (documenting master or waiting for a new release). I could update it as well as changelog in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the README.md
as well to document the new option and add a brief mention to the CHANGELOG.md
in the [Unreleased]
section under Added
. Something like
- Added option
:max_attributes
to control the maximum number of supported attributes.
I took a look and other than a few errant spaces and an update to the README and the CHANGELOG, this looks good. I'm not sure we have a policy for when the README gets updated, so we may as well do it now. One thing that might be unclear is should the limit be on the number of unique attributes or the number of attributes to parse? Duplicate attributes when there are only a small number of unique attributes should be fast to parse. I think the current implementation is almost a limit on the number of unique attributes. The one edge case is if the limit is n and there are n unique attributes and the next one to be parsed is a duplicate, then this will raise an error. I'm not sure I really care about that edge case, but it could probably be made consistent by deferring the check on the count of attributes until after the code checks if it's a duplicate. |
Sorry for the nasty spaces. I updated the README and CHANGELOG as well. Regarding the duplicate attribute edge case, here’s my take. Although Nokogumbo may be used in scenarios where the HTML payload is always trusted (HTML validation, etc), I feel when talking about 400+ attributes on a given element (with or without duplicates), we are much more likely to be facing an attack, not an error. And if we face an attack, we likely have not just a few hundreds of attributes but tens of thousands of them (like what we get when parsing spam emails received by our users, the very case which led me to investigate and fix this). In such cases I want to save as much CPU as possible, which I believe is better achieved by avoiding the loop on the current attributes we have stored. I’m saying this because I keep in mind the eventual option we could introduce to ignore attributes and continue parsing instead of raising. When raising, my argument isn’t relevant because you’ll only go through You’re the boss though; I can make it like you suggest if that makes more sense to you. :) |
What you say makes sense. There's no legitimate use case for 400 attributes. I was thinking more about when users pick a smaller limit. But like I said, I don't really care about that edge case so I'm happy to go with what you have. I'm not sure what's going on with the CI there, but it's not related to this PR. Thanks so much for working on this! |
And let me say I especially appreciate the script demonstrating the problem. It was really helpful for tracking down what was happening in libxml2. |
Thanks so much for merging! A pleasure working with you. :) |
FWIW, this was done in one of the commits you pulled from lua-gumbo, in order to "fix" google/gumbo-parser#393. |
Was forgotten in rubys#143 README changes.
Was forgotten in #143 README changes.
Prior to this commit, the time taken to parse elements with an unreasonable number of attributes is non-linear and can lead to excessive memory usage, process hanging and crash.
As a consumer of unsafe HTML, you must enforce some limits on payloads prior to parsing. One easy example is truncating the input string to some max length. However, in a case of extreme number of attributes, there’s no practical way to protect yourself while preserving a decent parsed HTML output (the HTML length isn’t that long by itself). Thus, I believe Nokogumbo should internally enforce a limit after which attributes of a given element will be ignored.
Here’s a script to illustrate the current non-linear work time:
With the proposed fix, that script becomes:
To determine what the value of
Nokogumbo::DEFAULT_MAX_ATTRIBUTES
should be, I looked at lists of attributes to have an idea of how many attributes you could technically use on a given element:https://github.com/wooorm/html-element-attributes/blob/master/index.json
https://github.com/wooorm/svg-element-attributes/blob/master/index.json
https://github.com/wooorm/aria-attributes/blob/master/index.json
(script used)
I believe all of these can be combined on the same SVG element. This obviously doesn’t consider
data-*
attributes, but these are unlimited in theory. So the value we’ll pick will be arbitrary no matter what. I ended up using400
, just because we haveNokogumbo::DEFAULT_MAX_TREE_DEPTH = 400
already. Same number seemed to make some sense.I assumed we’d want an argument to customize that limit. I went with
:max_attributes
although it could be clearer with:max_attributes_per_element
. But I don’t think the added length is desirable nor necessary.