fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

flavorjones · 2022-06-05T20:53:04Z

Please describe the bug

Code was added to limit the number of attributes supported per element to prevent DoS attacks: rubys/nokogumbo#143

That safety limit is here: https://github.com/sparklemotion/nokogiri/blob/main/gumbo-parser/src/tokenizer.c#L792

It would be great to support more attributes by addressing performance concerns in the implementation.

flavorjones · 2024-12-26T14:49:15Z

Benchmark from rubys/nokogumbo#143 recorded with v1.18.0:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
  gem "benchmark-ips"
end

Benchmark.ips do |x|
  x.warmup = 0

  [
    1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
  ].each do |attribute_count|
    html = <<~HTML
      <div
        #{attribute_count.times.map { |x| "fake-attr-#{x}" }.join("\n")}
      >
    HTML

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html, max_attributes: 100_000)
    end
  end
end

$ ./issues/2568-html5-attr-perf.rb
Calculating -------------------------------------
        1 attributes    247.327k (±10.3%) i/s    (4.04 μs/i) -    947.757k in   4.842134s
        2 attributes    226.253k (± 9.8%) i/s    (4.42 μs/i) -    867.564k in   4.856891s
        4 attributes    198.142k (±10.1%) i/s    (5.05 μs/i) -    749.855k in   4.874406s
        8 attributes    155.885k (± 9.6%) i/s    (6.42 μs/i) -    580.246k in   4.906291s
       16 attributes     95.502k (± 9.3%) i/s   (10.47 μs/i) -    350.276k in   4.943592s
       32 attributes     59.020k (± 8.6%) i/s   (16.94 μs/i) -    208.877k in   4.965003s
       64 attributes     28.047k (± 9.3%) i/s   (35.65 μs/i) -    105.076k in   4.979541s
      128 attributes     12.694k (± 9.6%) i/s   (78.78 μs/i) -     50.218k in   4.989599s
      256 attributes      4.975k (± 9.1%) i/s  (201.01 μs/i) -     20.663k in   5.000307s
      512 attributes      1.540k (± 8.0%) i/s  (649.43 μs/i) -      6.851k in   4.998259s
     1024 attributes    384.302 (±11.2%) i/s    (2.60 ms/i) -      1.825k in   4.999100s
     2048 attributes    127.054 (±11.0%) i/s    (7.87 ms/i) -    618.000 in   5.004923s
     4096 attributes     30.207 (± 9.9%) i/s   (33.11 ms/i) -    150.000 in   5.025409s
     8192 attributes      6.699 (± 0.0%) i/s  (149.27 ms/i) -     34.000 in   5.101572s

flavorjones · 2024-12-26T15:59:54Z

There are two primary hotspots:

libxml2's xmlNewPropInternal traversing the properties list to append
libgumbo's tokenizer.c:finish_attribute_name which checks for duplicates in the attribute list with this code:

nokogiri/gumbo-parser/src/tokenizer.c

Lines 804 to 820 in 729c96c

    
           for (unsigned int i = 0; i < attributes->length; ++i) { 
        
             GumboAttribute* attr = attributes->data[i]; 
        
             if ( 
        
               strlen(attr->name) == tag_state->_buffer.length 
        
               && 0 == memcmp ( 
        
                 attr->name, 
        
                 tag_state->_buffer.data, 
        
                 tag_state->_buffer.length 
        
               ) 
        
             ) { 
        
               // Identical attribute; bail. 
        
               add_duplicate_attr_error(parser); 
        
               reinitialize_tag_buffer(parser); 
        
               tag_state->_drop_next_attr_value = true; 
        
               return; 
        
             } 
        
           }

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

flavorjones added the topic/performance label Jun 5, 2022

flavorjones mentioned this issue Jun 5, 2022

Fix O(n^2) behavior when checking for duplicate attributes rubys/nokogumbo#144

Closed

flavorjones added the topic/gumbo Gumbo HTML5 parser label Jun 5, 2022

flavorjones added this to the v1.18.0 milestone Jul 3, 2024

flavorjones modified the milestones: v1.18.0, v1.19.0 Dec 16, 2024

flavorjones mentioned this issue Dec 26, 2024

perf: html5 attribute parsing #3393

Merged

flavorjones closed this as completed in #3393 Dec 29, 2024

flavorjones closed this as completed in 1d83082 Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

flavorjones commented Jun 5, 2022

flavorjones commented Dec 26, 2024

flavorjones commented Dec 26, 2024 •

edited

Loading

fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568

Comments

flavorjones commented Jun 5, 2022

flavorjones commented Dec 26, 2024

flavorjones commented Dec 26, 2024 • edited Loading

flavorjones commented Dec 26, 2024 •

edited

Loading