Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http module does not enforce spec #18405

Closed
JoshuaWise opened this issue Jan 27, 2018 · 6 comments
Closed

http module does not enforce spec #18405

JoshuaWise opened this issue Jan 27, 2018 · 6 comments
Labels
duplicate Issues and PRs that are duplicates of other issues or PRs. http Issues or PRs related to the http subsystem.

Comments

@JoshuaWise
Copy link

JoshuaWise commented Jan 27, 2018

  • Version: all
  • Platform: all
  • Subsystem: all

The HTTP spec (RFC 7230) has the following definition for the request-target (URI) of a request:

 request-target = origin-form
                / absolute-form
                / authority-form
                / asterisk-form

We can follow this further to find that each form has similar requirements, all defined here, which references RFC 3986. We then see the definitions of authority (hostname and port) and path-abempty (pathname).

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
host          = IP-literal / IPv4address / reg-name
port          = *DIGIT
reg-name      = *( unreserved / pct-encoded / sub-delims )

path-abempty  = *( "/" segment )
segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"

The specification restricts these values to specific sets of characters, and for good reason! However, the http module in Node.js simply allows any characters to appear here. Even the url module does not enforce the specification. Is it expected that anyone wishing to build an http server in Node.js should painstakingly examine these RFCs and do the protocol-level validation themselves?

@bnoordhuis bnoordhuis added duplicate Issues and PRs that are duplicates of other issues or PRs. http Issues or PRs related to the http subsystem. labels Jan 27, 2018
@bnoordhuis
Copy link
Member

Thanks for the report but since there are several open issues and pull requests about this particular topic already, I'm going to close out this one.

@TimothyGu
Copy link
Member

@bnoordhuis Can you link to some of them?

@bnoordhuis
Copy link
Member

I could more easily if GH's search was less terrible... at any rate, #13296 is one.

@TimothyGu
Copy link
Member

TimothyGu commented Jan 28, 2018

I've done some more investigations around this. It seems that there are multiple reasons for the way we act right now.

  1. Performance. Validating the request-target using the ABNF basically requires parsing the URL for every request. This is a task we have traditionally delegated to userland libraries, and in smaller applications that use frameworks URL parsing is in fact a bottleneck.
  2. Robustness. Postel's law of "Be liberal in what you accept, and conservative in what you send" still applies. It seems unwise for Node.js as a server platform to artificially enforce a validation that has not only nontrivial performance effects, but would presumably also have significant backward compatibility concerns.
  3. Up-to-dateness of RFC 3986 itself. The WHATWG URL Standard is the modern standard for URL parsing and interpretation that aims to replace RFC 3986/7. However, it is still somewhat unclear how to apply this Standard to HTTP targets (see Relative URLs in WHATWG URL API #12682). We do implement the entire URL-parsing according to the URL Standard through the URL API.

@JoshuaWise
Copy link
Author

JoshuaWise commented Jan 28, 2018

Thanks for the explanation. All of the above points make sense, but I think it's worth stating their counter arguments:

  1. Performance. Perhaps I'm naive, but I believe I've written a regular expression to parse a random URL according the HTTP spec in an average of 500 nanoseconds. This is about 10x faster than the legacy url module (which does almost no validation), and 50x faster than the new URL constructor, both of which are too general for HTTP URLs which have more requirements than general-case URLs. The regular expression below can be tweaked to be more or less forgiving of the input.
const requestTarget = /^(?:(?:(https?):\/\/|(?=[^/?*]+$))((?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)|(?:[a-z\d](?:[a-z\d-]{0,61}[a-z\d])?\.)*[a-z](?:[a-z\d-]{0,61}[a-z\d])?\.?)(?::([1-9]\d*|0)?)?(?!\*)|(?=[/*]))((?:\/(?:[\w.~:@!$&'()*+,;=-]|%[\da-f]{2})+)+\/?|\/|\*$)?(?:\?((?:[\w.~:@!$&'()*+,;=/?-]|%[\da-f]{2})*))?$/i;

const match = url.match(requestTarget);
// If `match` is null, the url is not a valid `request-target`
// Otherwise, we can split it into its parts:
const [, scheme, host, port, path, query] = match;
// If the `scheme` is truthy, the url is in `absolute-form`
// If the `scheme` is falsey and `host` is truthy, the url is in `authority-form`
// If the `scheme` is falsey and `host` is falsey, the url is in `origin-form` or `asterisk-form`
  1. Robustness. This is a valid point, but at the same time, I think it violates the "rule of least surprise" for a module called http to not actually enforce the http specification. It can also be considered a security concern if a developer assumes only certain data can come in, which I'm definitely guilty of—I've been using Node.js for years and I'm only now discovering that every server I've ever built in Node.js didn't follow my assumptions. Also, as far as I'm aware, none of the popular http server libraries (express, restify, etc.) do any extra validation on the URL. So if it's userland's responsibility, userland has been failing for years. To solve the problem of backwards compatibility, we could only add this validation to the http2 module, which has low stability anyways.

  2. Up-to-dateness. WHATWG states on their website:

The WHATWG’s focus is on standards implementable in web browsers

So it doesn't make sense to me to take their standards too seriously in other contexts, such as back-end applications. Implementation of their APIs should be considered optional for non-web-browser environments.

@TimothyGu
Copy link
Member

Perhaps I'm naive, but I believe I've written a regular expression to parse a random URL according the HTTP spec in an average of 500 nanoseconds.

There are many differences between your regular expression and real-world URL parsers:

  1. It does too much validation. Weird (and invalid, according to the RFC) URLs like https://:@test can be seen in the wild. In fact, I encourage you to look at urltestdata.json for a whole set of weird URLs modern implementations have to deal with.
  2. It does not do normalization. Take https://abc, for example. It is normalized to https://abc/ on most implementations of URL that I can find (from web browsers to Node.js' url.parse, to cURL; Python's urllib.parse is a notable exception, but it claims to follow the very old and obsoleted RFC 1808, RFC 3986's predecessor). Also, the set of characters to percent-encode differs significantly between the RFC and actual implementations.
  3. It does too little validation. Modern URL implementations have to deal with things like IDNA. This means that URLs like https://xn--abc should be seen as invalid, even though some major browsers (like Chrome) don't.

This is about 10x faster than the legacy url module (which does almost no validation), and 50x faster than the new URL constructor

Aside from regex vs. state machine, I'm interested in investigating the 5x performance difference you are observing between url.parse and URL. From our existing benchmark, they should be roughly even.

So it doesn't make sense to me to take their standards too seriously in other contexts, such as back-end applications. Implementation of their APIs should be considered optional for non-web-browser environments.

There is an important distinction between APIs and URL semantics. While implementing the exact APIs is certainly optional (though helpful for folks writing isomorphic applications), WHATWG as effectively the only organization that is still evolving URL should certainly be seen as an (if not the) authoritative source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate Issues and PRs that are duplicates of other issues or PRs. http Issues or PRs related to the http subsystem.
Projects
None yet
Development

No branches or pull requests

3 participants