Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update usage docs section on creating web archives #899

Merged
merged 4 commits into from
Apr 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 14 additions & 6 deletions docs/manual/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,20 +154,20 @@ To enable auto-indexing, run with ``wayback -a`` or ``wayback -a --auto-interval
Creating a Web Archive
----------------------

Using Webrecorder
^^^^^^^^^^^^^^^^^
Using ArchiveWeb.page
^^^^^^^^^^^^^^^^^^^^^

If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
If you do not have a web archive to test, one easy way to create one is to use the `ArchiveWeb.page <https://archiveweb.page>`_ browser extension for Chrome and other Chromium-based browsers such as Brave Browser. ArchiveWeb.page records pages visited during an archiving session in the browser, and provides means of both replaying and downloading the archived items created.

After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
Follow the instructions in `How To Create Web Archives with ArchiveWeb.page <https://archiveweb.page/en/usage/>`_. After recording, press **Stop** and then `download your collection <https://archiveweb.page/en/download/>`_ to receive a WARC (`.warc.gz`) file. If you choose to download your collection in the WACZ format, the WARC files can be found inside the zipped WACZ in the ``archive/`` directory.

You can then use this with work with pywb.
You can then use your WARCs to work with pywb.


Using pywb Recorder
^^^^^^^^^^^^^^^^^^^

The core recording functionality in Webrecorder is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
Recording functionality is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
done by directly recording into your pywb collection:

1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
Expand All @@ -180,6 +180,14 @@ In this configuration, the indexing happens every 10 seconds.. After 10 seconds,
``http://localhost:8080/my-web-archive/http://example.com/``


Using Browsertrix
^^^^^^^^^^^^^^^^^

For a more automated browser-based web archiving experience, `Browsertrix <https://browsertrix.com/>`_ provides a web interface for configuring, scheduling, running, reviewing, and curating crawls of web content. Crawl activity is shown in a live screencast of the browsers used for crawling and all web archives created in Browsertrix can be easily downloaded from the application in the WACZ format.

`Browsertrix Crawler <https://crawler.docs.browsertrix.com/>`_, which provides the underlying crawling functionality of Browsertrix, can also be run standalone in a Docker container on your local computer.


HTTP/S Proxy Mode Access
------------------------

Expand Down
Loading