Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaybackURLKeyMaker mangles URLs with IPv4-mapped IPv6 addresses #104

Open
sebastian-nagel opened this issue Dec 17, 2024 · 0 comments
Open

Comments

@sebastian-nagel
Copy link
Contributor

IPv4-mapped / IPv4-compatible IPv6 addresses (e.g., ::ffff:192.0.2.128) in URLs are mangled by WaybackURLKeyMaker: the enclosing square brackets are not removed, but moved around together with the parts of the host-port combination after splitting at dots:

jshell> import org.archive.url.WaybackURLKeyMaker;
jshell> var km = new WaybackURLKeyMaker();
jshell> km.makeKey("http://[::ffff:123.123.87.87]:8080/index.html")
$3 ==> "87],87,123,[::ffff:123:8080)/index.html"

For comparison, the Python surt module removes the square brackets before splitting at dots and moving reversing the parts:

$> pip3 show surt
Name: surt
Version: 0.3.1
Summary: Sort-friendly URI Reordering Transform (SURT) python package.

$> python3
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
>>> from surt import surt
>>> surt("http://[::ffff:123.123.87.87]:8080/index.html")
'87,87,123,::ffff:123:8080)/index.html'

I'm not sure, what the best representation is:

  • normalize the IPv4-mapped representation - ::ffff:123.123.87.87 becomes
    • ::ffff:7b7b:5757
    • or 123.123.87.87
  • the double use of the colon in IPv6 addresses and as port separator is troublesome, but maybe not an issue, because SURT keys are recall-oriented and some ambiguity is acceptable. It'd be also a separate issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant