http module does not enforce spec #18405

JoshuaWise · 2018-01-27T00:03:31Z

Version: all
Platform: all
Subsystem: all

The HTTP spec (RFC 7230) has the following definition for the request-target (URI) of a request:

 request-target = origin-form
                / absolute-form
                / authority-form
                / asterisk-form

We can follow this further to find that each form has similar requirements, all defined here, which references RFC 3986. We then see the definitions of authority (hostname and port) and path-abempty (pathname).

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
host          = IP-literal / IPv4address / reg-name
port          = *DIGIT
reg-name      = *( unreserved / pct-encoded / sub-delims )

path-abempty  = *( "/" segment )
segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"

The specification restricts these values to specific sets of characters, and for good reason! However, the http module in Node.js simply allows any characters to appear here. Even the url module does not enforce the specification. Is it expected that anyone wishing to build an http server in Node.js should painstakingly examine these RFCs and do the protocol-level validation themselves?

The text was updated successfully, but these errors were encountered:

bnoordhuis · 2018-01-27T11:17:34Z

Thanks for the report but since there are several open issues and pull requests about this particular topic already, I'm going to close out this one.

TimothyGu · 2018-01-27T20:51:18Z

@bnoordhuis Can you link to some of them?

bnoordhuis · 2018-01-27T21:18:57Z

I could more easily if GH's search was less terrible... at any rate, #13296 is one.

TimothyGu · 2018-01-28T21:43:21Z

I've done some more investigations around this. It seems that there are multiple reasons for the way we act right now.

Performance. Validating the request-target using the ABNF basically requires parsing the URL for every request. This is a task we have traditionally delegated to userland libraries, and in smaller applications that use frameworks URL parsing is in fact a bottleneck.
Robustness. Postel's law of "Be liberal in what you accept, and conservative in what you send" still applies. It seems unwise for Node.js as a server platform to artificially enforce a validation that has not only nontrivial performance effects, but would presumably also have significant backward compatibility concerns.
Up-to-dateness of RFC 3986 itself. The WHATWG URL Standard is the modern standard for URL parsing and interpretation that aims to replace RFC 3986/7. However, it is still somewhat unclear how to apply this Standard to HTTP targets (see Relative URLs in WHATWG URL API #12682). We do implement the entire URL-parsing according to the URL Standard through the URL API.

JoshuaWise · 2018-01-28T23:00:45Z

Thanks for the explanation. All of the above points make sense, but I think it's worth stating their counter arguments:

Performance. Perhaps I'm naive, but I believe I've written a regular expression to parse a random URL according the HTTP spec in an average of 500 nanoseconds. This is about 10x faster than the legacy url module (which does almost no validation), and 50x faster than the new URL constructor, both of which are too general for HTTP URLs which have more requirements than general-case URLs. The regular expression below can be tweaked to be more or less forgiving of the input.

const requestTarget = /^(?:(?:(https?):\/\/|(?=[^/?*]+$))((?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)|(?:[a-z\d](?:[a-z\d-]{0,61}[a-z\d])?\.)*[a-z](?:[a-z\d-]{0,61}[a-z\d])?\.?)(?::([1-9]\d*|0)?)?(?!\*)|(?=[/*]))((?:\/(?:[\w.~:@!$&'()*+,;=-]|%[\da-f]{2})+)+\/?|\/|\*$)?(?:\?((?:[\w.~:@!$&'()*+,;=/?-]|%[\da-f]{2})*))?$/i;

const match = url.match(requestTarget);
// If `match` is null, the url is not a valid `request-target`
// Otherwise, we can split it into its parts:
const [, scheme, host, port, path, query] = match;
// If the `scheme` is truthy, the url is in `absolute-form`
// If the `scheme` is falsey and `host` is truthy, the url is in `authority-form`
// If the `scheme` is falsey and `host` is falsey, the url is in `origin-form` or `asterisk-form`

Robustness. This is a valid point, but at the same time, I think it violates the "rule of least surprise" for a module called http to not actually enforce the http specification. It can also be considered a security concern if a developer assumes only certain data can come in, which I'm definitely guilty of—I've been using Node.js for years and I'm only now discovering that every server I've ever built in Node.js didn't follow my assumptions. Also, as far as I'm aware, none of the popular http server libraries (express, restify, etc.) do any extra validation on the URL. So if it's userland's responsibility, userland has been failing for years. To solve the problem of backwards compatibility, we could only add this validation to the http2 module, which has low stability anyways.
Up-to-dateness. WHATWG states on their website:

The WHATWG’s focus is on standards implementable in web browsers

So it doesn't make sense to me to take their standards too seriously in other contexts, such as back-end applications. Implementation of their APIs should be considered optional for non-web-browser environments.

TimothyGu · 2018-01-28T23:42:21Z

Perhaps I'm naive, but I believe I've written a regular expression to parse a random URL according the HTTP spec in an average of 500 nanoseconds.

There are many differences between your regular expression and real-world URL parsers:

It does too much validation. Weird (and invalid, according to the RFC) URLs like https://:@test can be seen in the wild. In fact, I encourage you to look at urltestdata.json for a whole set of weird URLs modern implementations have to deal with.
It does not do normalization. Take https://abc, for example. It is normalized to https://abc/ on most implementations of URL that I can find (from web browsers to Node.js' url.parse, to cURL; Python's urllib.parse is a notable exception, but it claims to follow the very old and obsoleted RFC 1808, RFC 3986's predecessor). Also, the set of characters to percent-encode differs significantly between the RFC and actual implementations.
It does too little validation. Modern URL implementations have to deal with things like IDNA. This means that URLs like https://xn--abc should be seen as invalid, even though some major browsers (like Chrome) don't.

This is about 10x faster than the legacy url module (which does almost no validation), and 50x faster than the new URL constructor

Aside from regex vs. state machine, I'm interested in investigating the 5x performance difference you are observing between url.parse and URL. From our existing benchmark, they should be roughly even.

So it doesn't make sense to me to take their standards too seriously in other contexts, such as back-end applications. Implementation of their APIs should be considered optional for non-web-browser environments.

There is an important distinction between APIs and URL semantics. While implementing the exact APIs is certainly optional (though helpful for folks writing isomorphic applications), WHATWG as effectively the only organization that is still evolving URL should certainly be seen as an (if not the) authoritative source.

bnoordhuis added duplicate Issues and PRs that are duplicates of other issues or PRs. http Issues or PRs related to the http subsystem. labels Jan 27, 2018

bnoordhuis closed this as completed Jan 27, 2018

This was referenced Sep 14, 2018

Reject request paths containing characters not allowing by HTTP spec expressjs/express#3746

Closed

404 Not found XSS (Cross Site Scripting Issue) loopbackio/strong-error-handler#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http module does not enforce spec #18405

http module does not enforce spec #18405

JoshuaWise commented Jan 27, 2018 •

edited

Loading

bnoordhuis commented Jan 27, 2018

TimothyGu commented Jan 27, 2018

bnoordhuis commented Jan 27, 2018

TimothyGu commented Jan 28, 2018 •

edited

Loading

JoshuaWise commented Jan 28, 2018 •

edited

Loading

TimothyGu commented Jan 28, 2018

http module does not enforce spec #18405

http module does not enforce spec #18405

Comments

JoshuaWise commented Jan 27, 2018 • edited Loading

bnoordhuis commented Jan 27, 2018

TimothyGu commented Jan 27, 2018

bnoordhuis commented Jan 27, 2018

TimothyGu commented Jan 28, 2018 • edited Loading

JoshuaWise commented Jan 28, 2018 • edited Loading

TimothyGu commented Jan 28, 2018

JoshuaWise commented Jan 27, 2018 •

edited

Loading

TimothyGu commented Jan 28, 2018 •

edited

Loading

JoshuaWise commented Jan 28, 2018 •

edited

Loading