Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signed WARC URL generation #79

Open
ato opened this issue Mar 9, 2020 · 0 comments
Open

Signed WARC URL generation #79

ato opened this issue Mar 9, 2020 · 0 comments

Comments

@ato
Copy link
Member

ato commented Mar 9, 2020

@ikreymer has proposed a web archive architecture with replay capability purely client-side served by static instance of wabac.js, WARC files server by a simple static file server (nginx, S3) and OutbackCDX as the only dynamic server-side component. While technically this obviously already is totally doable it does mean making the full raw WARC files available for download which is likely unacceptable for many institutions who have a requirement to implement some level of restrictions or access controls.

Ilya suggested one solution to this problem would be for the index server to generated signed URLs which include a signature (or some other form of access token) which provides temporary access to specific records.

nginx

There are a lot of different nginx modules that can handle URLs with some kind of signature, HMAC or auth token. The stock secure link module would technically work but is probably best avoided as it uses MD5.

A simple example using https://github.com/nginx-modules/ngx_http_hmac_secure_link_module might be:

location /warcs {
    secure_link_hmac  $arg_token,$arg_timestamp,$arg_expiry;
    secure_link_hmac_secret my_secret_key;
    secure_link_hmac_message $uri|$arg_timestamp|$arg_expiry|$http_range;
    secure_link_hmac_algorithm sha256;
    if ($secure_link_hmac != "1") { return 404; }
}

With a URL that looks like:

https://warcstore/something.warc.gz?timestamp=2020-03-09T09:55:46Z&expiry=900&token=98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

Note how the HMAC is configured to include $http_range which ensures the request is only valid for a single specific byte range.

S3

S3 has signed URLs which works rather similarly:

https://my-warc-store.s3-eu-west-1.amazonaws.com/something.warc.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE/20130721/us-east-1/s3/aws4_request
&X-Amz-Date=20200409T096646Z
&X-Amz-Expires=900
&X-Amz-Signature=13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de
&X-Amz-SignedHeaders=host;range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant