Skip to content

Commit

Permalink
WIP: Start adding documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
tw4l committed Sep 26, 2024
1 parent 4dc72a9 commit 4fd3631
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/deploy/customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,4 +149,4 @@ Browsertrix has the ability to cryptographically sign WACZ files with [Authsign]

## Enable Open Registration

You can enable sign-ups by setting `registration_enabled` to `"1"`. Once enabled, your users can register by visiting `/sign-up`.
You can enable sign-ups by setting `registration_enabled` to `"1"`. Once enabled, your users can register by visiting `/sign-up`.
2 changes: 1 addition & 1 deletion docs/deploy/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,6 @@ The main requirements for Browsertrix are:
- A Kubernetes Cluster
- [Helm 3](https://helm.sh/) (package manager for Kubernetes)

We have prepared a [Local Deployment Guide](local.md) which covers several options for testing Browsertrix locally on a single machine, as well as a [Production (Self-Hosted and Cloud) Deployment](remote.md) guide to help with setting up Browsertrix in different production scenarios. Information about configuring storage, crawler channels, and other details in local or production deployments is in the [Customizing Browsertrix Deployment Guide](customization.md).
We have prepared a [Local Deployment Guide](local.md) which covers several options for testing Browsertrix locally on a single machine, as well as a [Production (Self-Hosted and Cloud) Deployment](remote.md) guide to help with setting up Browsertrix in different production scenarios. Information about configuring storage, crawler channels, and other details in local or production deployments is in the [Customizing Browsertrix Deployment Guide](customization.md). Information about configuring proxies to use with Browsertrix can be found in the [Configuring Proxies](proxies.md) guide.

Details on managing org export and import for existing clusters can be found in the [Org Import & Export](admin/org-import-export.md) guide.
27 changes: 27 additions & 0 deletions docs/deploy/proxies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Configuring Proxies

Browsertrix can be configured to direct crawling traffic through dedicated proxy servers, so that websites can be crawled from a specific geographic location regardless of where Browsertrix itself is deployed.

This guide covers how to set up proxy servers for use with Browsertrix, as well as how to configure Browsertrix to make those proxies available.

## Proxy Configuration

Browsertrix supports crawling through HTTP and SOCKS5 proxies, including through a SOCKS5 proxy over an SSH tunnel. For more information on what is supported in the underlying Browsertrix Crawler, see the [Browsertrix Crawler documentation](https://crawler.docs.browsertrix.com/user-guide/proxies/).

Many commercial proxy services exist. If you are planning to use commercially-provided proxies, continue to [Browsertrix Configuration](#browsertrix-configuration) below.

To set up your own proxy server to use with Browsertrix as SOCKS5 over SSH, the first thing that is needed is a physical or virtual server that you intend to use as the proxy. Once you have access to this remote machine, you will need to add the public key of a public/private key pair (we recommend using a new ECDSA key pair) to support ssh connections to the remote machine. You will need to supply the corresponding private key to Browsertrix in [Browsertrix Configuration](#browsertrix-configuration) below.

(TODO: More technical setup details as needed)

## Browsertrix Configuration

Proxies are configured in Browsertrix through a separate deployment and subchart. This enables easier updates to available proxy servers without needing to redeploy the entire Browsertrix application.

To add or update proxies to your Browsertrix Deployment, modify the `btrix-proxies` section of the main Helm chart or your local override.

First, set `enabled` to `true`, which will enable deploying proxy servers.

Next, provide the details of each proxy server that you want available within Browsertrix in the `proxies` list. Minimally, an id, connection string URL, label, and two-letter country code must be set for each proxy. If you want a particular proxy to be shared and potentially available to all organizations on a Browsertrix deployment, set `shared` to `true`. For SSH proxy servers, an `ssh_private_key` is required, and the contents of a known hosts file can additionally be provided to help secure a connection.

Once all proxy details are set, deploy the proxies by (TODO: add these details)
6 changes: 6 additions & 0 deletions docs/user-guide/workflow-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,12 @@ Sets the browser's [user agent](https://developer.mozilla.org/en-US/docs/Web/HTT

Sets the browser's language setting. Useful for crawling websites that detect the browser's language setting and serve content accordingly.

### Proxy

Sets the proxy server that [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) will direct traffic through while crawling. When a proxy is selected, crawled websites will see traffic as coming from the IP address of the proxy rather than where the Browsertrix Crawler node is deployed.

This setting will only be shown if proxies are available for use.

## Scheduling

Automatically start crawls periodically on a daily, weekly, or monthly schedule.
Expand Down

0 comments on commit 4fd3631

Please sign in to comment.