Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Quadratic time and memory use on very deep trees #393

Closed
DemiMarie opened this issue Jul 25, 2017 · 10 comments
Closed

Quadratic time and memory use on very deep trees #393

DemiMarie opened this issue Jul 25, 2017 · 10 comments

Comments

@DemiMarie
Copy link

The following program takes O(n^2) time (but not memory) for very large arguments. I suspect one needs #392 applied to notice (otherwise, one hits a stack overflow first).

#include <ctype.h>
#include <gumbo.h>
#include <stdint.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fputs("Must have exactly one argument: depth of tree to make\n", stderr);
    return 1;
  }
  char *endp;
  unsigned long long q = strtoull(argv[1], &endp, 0);
  const size_t maxarg = (SIZE_MAX - 9) / sizeof "<i>";
  if (q > maxarg) {
    fprintf(stderr, "Argument too large!  Must be at or below %zu\n", maxarg);
    exit(1);
  }
  if (*endp) {
    fprintf(stderr, "Trailing junk at end of argument!\n");
    exit(1);
  }
  char *buf = malloc(q * (2*sizeof "<i>" - 1) + sizeof "<i>\n");
  if (!buf) {
    fputs("Out of mem allocating buffer!\n", stderr);
    return 1;
  }
  char *bufptr = buf;
  memcpy(bufptr, "<i>\n", 4);
  bufptr += 4;
  for (size_t i = 0; i < q; ++i) {
    memcpy(bufptr, "<i>", 3);
    bufptr += 3;
  }
  for (size_t i = 0; i < q; ++i) {
    memcpy(bufptr, "</i>", 3);
    bufptr += 4;
  }
  GumboOptions options;
  memcpy(&options, &kGumboDefaultOptions, sizeof options);
  options.max_errors = 50;
  GumboOutput *output = gumbo_parse_with_options(&options, buf, strlen(buf));
  if (!output) {
    fputs("Out of mem while parsing!\n", stderr);
    free(buf);
    return 1;
  }
  gumbo_destroy_output(&kGumboDefaultOptions, output);
  free(buf);
  return 0;
}

perf reveals that the time is spent in reconstruct_active_formatting_elements.

@DemiMarie
Copy link
Author

Running perf on a debug build reveals that the problem is that is_open_element is O(n) and called O(n) times.

@kevinhendricks
Copy link
Contributor

Yes, is_open_element simply walks a vector (ordered) of open elements (effectively a stack of open element nodes) to see if a specific GumboNode is present. To change its O(n) behaviour, you would need to use a much more complex data structure than a gumbo_vector that either keeps an associated hash table of node contents or at the very least keeps a sorted node order variant in parallel to allow binary search which simply increases the memory footprint.

Google spent a large amount of time benchmarking this parser against actual code on the web (millions of pages) and very very deep nesting is just not that common in real html usage. In other words the number of open elements at any point in the parsing is typically quite small making linear search over a handful of nodes in a vector the common use case. I remember an early discussion of those performance tuning stats in either one of the pull requests or issues, but I can't remember which.

IMO, handling your "crafted" webpages using high levels of nesting, gumbo-parser should never allow for stack overrun but at the same time, these types of pages should never be used to determine typical performance/design tradeoffs. Using a "smarter" data structure to search what might be 3 to 10 elements in the most common case, will simply expand the memory footprint and code complexity without much actual gain in typical use cases. Since memory is not an issue here, neither out-of-memory nor stack overrun (security) are issues.

@DemiMarie
Copy link
Author

DemiMarie commented Jul 25, 2017 via email

@kevinhendricks
Copy link
Contributor

Yes a nesting limit makes sense for your use case of a sanitizer. Do you then serialize the sanitized html page and feed it upstream or do you work with the created gumbo node-tree directly? If you serialize, then your non-recurisve tree traversal PR addition would be useful for serialization as if it matches the normal recursive order used by most serializers.

If this is also something you are going to address with a PR, please think about making that limit be controllable via the Gumbo Parser Options and work similar to how max_errors works (-1 for unlimited nesting, nesting limit otherwise).

@DemiMarie
Copy link
Author

On further thought, I don’t know if a nesting limit is a viable option here. The problem is this: suppose the nesting limit is exceeded. Now what? Most code that calls Gumbo (that I have observed) doesn’t check the return value. Returning NULL is thus not an option, except perhaps if Gumbo runs out of memory, since it will probably result in a segfault (fortunately not exploitable). So Gumbo must return something no matter what.

Why is is_open_element even a function? It seems to me that this should be a field in GumboElement.

@kevinhendricks
Copy link
Contributor

kevinhendricks commented Jul 25, 2017 via email

@kevinhendricks
Copy link
Contributor

See option->stop_on_error (or something along those lines) in the main parsing loop near the very end of parser.c and how it stops early after a parsing error has been detected if enabled. Something similar could be easily done to test current nesting limits and aborting further parsing.

@kevinhendricks
Copy link
Contributor

Also is_open_element is a function of the parser state when parsing, not a gumbo node element state (ie. by the time parsing is complete, no elements would be on the is_open_element list and thus not something that shoud be stored inside an element itself.

@kevinhendricks
Copy link
Contributor

kevinhendricks commented Jul 27, 2017

That said ... I see now there already is a parser flags field in a GumboNode. You could "extend/abuse" the parser flags enum to add a "currently_in_open_elements" (private) flag bit and set and unset it as needed when adding/popping a node from open_elements without changing the abi. Then you could use it to implement your idea of replacing the O(n) test is_open_element with a quick node parser flag bit test.

@kevinhendricks
Copy link
Contributor

@DemiMarie If running a sanitizer you might also want to look at PR 384 to detect and prevent overflow with long numeric entities: #384

craigbarnes referenced this issue in craigbarnes/lua-gumbo Mar 2, 2018
During testing on OpenBSD 6.0, the test case with an input of 500 nested
div elements actually caused a stack overflow with Lua 5.3 (but not 5.2
or 5.1). Lowering the depth limit to 400 seems to solve the issue.

Allowing nesting of 800 levels deep was absurd anyway. Even WebKit's
default limit is only 512. Documents with more than a few hundred levels
of nesting are either badly authored, buggy or malicious.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants