-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix O(n^2) behavior when checking for duplicate attributes (libgumbo) #2568
Labels
Milestone
Comments
Benchmark from rubys/nokogumbo#143 recorded with v1.18.0: #!/usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri"
gem "benchmark-ips"
end
Benchmark.ips do |x|
x.warmup = 0
[
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
].each do |attribute_count|
html = <<~HTML
<div
#{attribute_count.times.map { |x| "fake-attr-#{x}" }.join("\n")}
>
HTML
x.report "#{attribute_count.to_s.rjust(7)} attributes" do
Nokogiri::HTML5(html, max_attributes: 100_000)
end
end
end
|
There are two primary hotspots:
nokogiri/gumbo-parser/src/tokenizer.c Lines 804 to 820 in 729c96c
|
flavorjones
added a commit
that referenced
this issue
Dec 26, 2024
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
flavorjones
added a commit
that referenced
this issue
Dec 26, 2024
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
flavorjones
added a commit
that referenced
this issue
Dec 26, 2024
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
flavorjones
added a commit
that referenced
this issue
Dec 27, 2024
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
flavorjones
added a commit
that referenced
this issue
Dec 28, 2024
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
flavorjones
added a commit
that referenced
this issue
Dec 28, 2024
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Please describe the bug
Originally rubys/nokogumbo#144
Code was added to limit the number of attributes supported per element to prevent DoS attacks: rubys/nokogumbo#143
That safety limit is here: https://github.com/sparklemotion/nokogiri/blob/main/gumbo-parser/src/tokenizer.c#L792
It would be great to support more attributes by addressing performance concerns in the implementation.
The text was updated successfully, but these errors were encountered: