Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

Closed
flavorjones opened this issue Jun 5, 2022 · 2 comments · Fixed by #3393
Closed

fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

flavorjones opened this issue Jun 5, 2022 · 2 comments · Fixed by #3393
Labels
Milestone

Comments

@flavorjones
Copy link
Member

Please describe the bug

Originally rubys/nokogumbo#144

Code was added to limit the number of attributes supported per element to prevent DoS attacks: rubys/nokogumbo#143

That safety limit is here: https://github.com/sparklemotion/nokogiri/blob/main/gumbo-parser/src/tokenizer.c#L792

It would be great to support more attributes by addressing performance concerns in the implementation.

@flavorjones
Copy link
Member Author

Benchmark from rubys/nokogumbo#143 recorded with v1.18.0:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
  gem "benchmark-ips"
end

Benchmark.ips do |x|
  x.warmup = 0

  [
    1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
  ].each do |attribute_count|
    html = <<~HTML
      <div
        #{attribute_count.times.map { |x| "fake-attr-#{x}" }.join("\n")}
      >
    HTML

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html, max_attributes: 100_000)
    end
  end
end
$ ./issues/2568-html5-attr-perf.rb
Calculating -------------------------------------
        1 attributes    247.327k (±10.3%) i/s    (4.04 μs/i) -    947.757k in   4.842134s
        2 attributes    226.253k (± 9.8%) i/s    (4.42 μs/i) -    867.564k in   4.856891s
        4 attributes    198.142k (±10.1%) i/s    (5.05 μs/i) -    749.855k in   4.874406s
        8 attributes    155.885k (± 9.6%) i/s    (6.42 μs/i) -    580.246k in   4.906291s
       16 attributes     95.502k (± 9.3%) i/s   (10.47 μs/i) -    350.276k in   4.943592s
       32 attributes     59.020k (± 8.6%) i/s   (16.94 μs/i) -    208.877k in   4.965003s
       64 attributes     28.047k (± 9.3%) i/s   (35.65 μs/i) -    105.076k in   4.979541s
      128 attributes     12.694k (± 9.6%) i/s   (78.78 μs/i) -     50.218k in   4.989599s
      256 attributes      4.975k (± 9.1%) i/s  (201.01 μs/i) -     20.663k in   5.000307s
      512 attributes      1.540k (± 8.0%) i/s  (649.43 μs/i) -      6.851k in   4.998259s
     1024 attributes    384.302 (±11.2%) i/s    (2.60 ms/i) -      1.825k in   4.999100s
     2048 attributes    127.054 (±11.0%) i/s    (7.87 ms/i) -    618.000 in   5.004923s
     4096 attributes     30.207 (± 9.9%) i/s   (33.11 ms/i) -    150.000 in   5.025409s
     8192 attributes      6.699 (± 0.0%) i/s  (149.27 ms/i) -     34.000 in   5.101572s

@flavorjones
Copy link
Member Author

flavorjones commented Dec 26, 2024

There are two primary hotspots:

  • libxml2's xmlNewPropInternal traversing the properties list to append
  • libgumbo's tokenizer.c:finish_attribute_name which checks for duplicates in the attribute list with this code:

for (unsigned int i = 0; i < attributes->length; ++i) {
GumboAttribute* attr = attributes->data[i];
if (
strlen(attr->name) == tag_state->_buffer.length
&& 0 == memcmp (
attr->name,
tag_state->_buffer.data,
tag_state->_buffer.length
)
) {
// Identical attribute; bail.
add_duplicate_attr_error(parser);
reinitialize_tag_buffer(parser);
tag_state->_drop_next_attr_value = true;
return;
}
}

flavorjones added a commit that referenced this issue Dec 26, 2024
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
flavorjones added a commit that referenced this issue Dec 26, 2024
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
flavorjones added a commit that referenced this issue Dec 26, 2024
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
flavorjones added a commit that referenced this issue Dec 27, 2024
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
flavorjones added a commit that referenced this issue Dec 28, 2024
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
flavorjones added a commit that referenced this issue Dec 28, 2024
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant