-
Notifications
You must be signed in to change notification settings - Fork 663
Quadratic time and memory use on very deep trees #393
Comments
Running |
Yes, is_open_element simply walks a vector (ordered) of open elements (effectively a stack of open element nodes) to see if a specific GumboNode is present. To change its O(n) behaviour, you would need to use a much more complex data structure than a gumbo_vector that either keeps an associated hash table of node contents or at the very least keeps a sorted node order variant in parallel to allow binary search which simply increases the memory footprint. Google spent a large amount of time benchmarking this parser against actual code on the web (millions of pages) and very very deep nesting is just not that common in real html usage. In other words the number of open elements at any point in the parsing is typically quite small making linear search over a handful of nodes in a vector the common use case. I remember an early discussion of those performance tuning stats in either one of the pull requests or issues, but I can't remember which. IMO, handling your "crafted" webpages using high levels of nesting, gumbo-parser should never allow for stack overrun but at the same time, these types of pages should never be used to determine typical performance/design tradeoffs. Using a "smarter" data structure to search what might be 3 to 10 elements in the most common case, will simply expand the memory footprint and code complexity without much actual gain in typical use cases. Since memory is not an issue here, neither out-of-memory nor stack overrun (security) are issues. |
I agree with your assessment. The problem is that Gumbo is used in
server-side tools such as HTML sanitizers. For such tools, this is a
denial of service security vulnerability.
One option would be to impose an arbitrary nesting limit, and fail the
parse if the nesting level is too deep.
…On Jul 25, 2017 9:53 AM, "Kevin Hendricks" ***@***.***> wrote:
Yes, is_open_element simply walks a vector (ordered) of open elements
(effectively a stack of open element nodes) to see if a specific GumboNode
is present. To change its O(n) behaviour, you would need to use a much more
complex data structure than a gumbo_vector that either keeps an associated
hash table of node contents or at the very least keeps a sorted node order
variant in parallel to allow binary search which simply increases the
memory footprint.
Google spent a large amount of time benchmarking this parser against
actual code on the web (millions of pages) and very very deep nesting is
just not that common in real html usage. In other words the number of open
elements at any point in the parsing is typically quite small making linear
search over a handful of nodes in a vector the common use case. I remember
an early discussion of those performance tuning stats in either one of the
pull requests or issues, but I can't remember which.
IMO, handling your "crafted" webpages using high levels of nesting,
gumbo-parser should never allow for stack overrun but at the same time,
these types of pages should never be used to determine typical
performance/design tradeoffs. Using a "smarter" data structure to search
what might be 3 to 10 elements in the most common case, will simply expand
the memory footprint and code complexity without much actual gain in
typical use cases. Since memory is not an issue here, neither out-of-memory
nor stack overrun (security) are issues.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#393 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGGWB4rTCIpLW8k4DtlR1v5jQNqZsnaiks5sRfNggaJpZM4OiCSw>
.
|
Yes a nesting limit makes sense for your use case of a sanitizer. Do you then serialize the sanitized html page and feed it upstream or do you work with the created gumbo node-tree directly? If you serialize, then your non-recurisve tree traversal PR addition would be useful for serialization as if it matches the normal recursive order used by most serializers. If this is also something you are going to address with a PR, please think about making that limit be controllable via the Gumbo Parser Options and work similar to how max_errors works (-1 for unlimited nesting, nesting limit otherwise). |
On further thought, I don’t know if a nesting limit is a viable option here. The problem is this: suppose the nesting limit is exceeded. Now what? Most code that calls Gumbo (that I have observed) doesn’t check the return value. Returning NULL is thus not an option, except perhaps if Gumbo runs out of memory, since it will probably result in a segfault (fortunately not exploitable). So Gumbo must return something no matter what. Why is |
The spec requires keeping a list of all open elements. A Gumbo element can only be on the list if it has a start tag and no end tag (or implied end tag) has happened yet during parsing. This info is used for the routine that walks the list of open elements to try to find an appropriate parent when tag nesting order is mismatched (<i><b>blah</i></b>) ie. used by adoption agency (among others).
The parse tree does not have to abort and return null, it can simply stop adding to the tree when the nesting limit is reached, similar to how it can stop parsing once a set number of errors have happened and return the partial tree as it stands up to the point the nesting limit was reached.
… On Jul 25, 2017, at 5:55 PM, Demi Marie Obenour ***@***.***> wrote:
On further thought, I don’t know if a nesting limit is a viable option here. The problem is this: suppose the nesting limit is exceeded. Now what? Most code that calls Gumbo (that I have observed) doesn’t check the return value. Returning NULL is thus not an option, except perhaps if Gumbo runs out of memory, since it will probably result in a segfault (fortunately not exploitable). So Gumbo must return something no matter what.
Why is is_open_element even a function? It seems to me that this should be a field in GumboElement.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
See option->stop_on_error (or something along those lines) in the main parsing loop near the very end of parser.c and how it stops early after a parsing error has been detected if enabled. Something similar could be easily done to test current nesting limits and aborting further parsing. |
Also is_open_element is a function of the parser state when parsing, not a gumbo node element state (ie. by the time parsing is complete, no elements would be on the is_open_element list and thus not something that shoud be stored inside an element itself. |
That said ... I see now there already is a parser flags field in a GumboNode. You could "extend/abuse" the parser flags enum to add a "currently_in_open_elements" (private) flag bit and set and unset it as needed when adding/popping a node from open_elements without changing the abi. Then you could use it to implement your idea of replacing the O(n) test is_open_element with a quick node parser flag bit test. |
@DemiMarie If running a sanitizer you might also want to look at PR 384 to detect and prevent overflow with long numeric entities: #384 |
During testing on OpenBSD 6.0, the test case with an input of 500 nested div elements actually caused a stack overflow with Lua 5.3 (but not 5.2 or 5.1). Lowering the depth limit to 400 seems to solve the issue. Allowing nesting of 800 levels deep was absurd anyway. Even WebKit's default limit is only 512. Documents with more than a few hundred levels of nesting are either badly authored, buggy or malicious.
The following program takes O(n^2) time (but not memory) for very large arguments. I suspect one needs #392 applied to notice (otherwise, one hits a stack overflow first).
perf
reveals that the time is spent inreconstruct_active_formatting_elements
.The text was updated successfully, but these errors were encountered: