-
-
Notifications
You must be signed in to change notification settings - Fork 42
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Merge workflow job types (#2068)
Resolves #2073 ### Changes - Removes "URL List" and "Seeded Crawl" job type distinction and adds as additional crawl scope types instead. - 'New Workflow' button defaults to Single Page - 'New Workflow' dropdown includes Page Crawl (Single Page, Page List, In-Page Links) and Site Crawl (Page in Same Directory, Page on Same Domain, + Subdomains and Custom Page Prefix) - Enables specifying `DOCS_URL` in `.env` - Additional follow-ups in #2090, #2091
- Loading branch information
Showing
28 changed files
with
911 additions
and
908 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,83 +6,114 @@ Changes to a setting will only apply to subsequent crawls. | |
|
||
Crawl settings are shown in the crawl workflow detail **Settings** tab and in the archived item **Crawl Settings** tab. | ||
|
||
## Crawl Scope | ||
## Scope | ||
|
||
Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _URL List_ or _Site Crawl_ when creating a new workflow. | ||
Specify the range and depth of your crawl. | ||
|
||
??? example "Crawling with HTTP basic auth" | ||
|
||
Both Page List and Site Crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:[email protected]`. | ||
|
||
**These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished. | ||
Crawl scopes are categorized as a **Page Crawl** or **Site Crawl**: | ||
|
||
### Crawl Type: Page List | ||
_Page Crawl_ | ||
: Choose one of these crawl scopes if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out. | ||
|
||
#### Page URL(s) | ||
A Page Crawl workflow can be simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive. | ||
|
||
A list of one or more URLs that the crawler should visit and capture. | ||
??? info "Page Crawl Use Cases" | ||
- You want to archive a social media post (`Single Page`) | ||
- You have a list of URLs that you can copy-and-paste (`List of Pages`) | ||
- You want to include URLs with different domain names in the same crawl (`List of Pages`) | ||
|
||
#### Include Any Linked Page | ||
_Site Crawl_ | ||
: Choose one of these crawl scopes to have the the crawler automatically find pages based on a domain name, start page URL, or directory on a website. | ||
|
||
When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field. | ||
Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving. | ||
|
||
??? example "Crawling tags & search queries with Page List crawls" | ||
This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page. | ||
??? info "Site Crawl Use Cases" | ||
- You're archiving a subset of a website, like everything under _website.com/your-username_ (`Pages in Same Directory`) | ||
- You're archiving an entire website _and_ external pages linked to from the website (`Pages on Same Domain` + _Include Any Linked Page_ checked) | ||
|
||
#### Fail Crawl on Failed URL | ||
### Crawl Scope Options | ||
|
||
When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed". | ||
#### Page Crawl | ||
|
||
### Crawl Type: Site Crawl | ||
`Single Page` | ||
: Crawls a single URL and does not include any linked pages. | ||
|
||
#### Crawl Start URL | ||
`List of Pages` | ||
: Crawls only specified URLs and does not include any linked pages. | ||
|
||
This is the first page that the crawler will visit. It's important to set _Crawl Start URL_ that accurately represents the scope of the pages you wish to crawl as the _Start URL Scope_ selection will depend on this field's contents. | ||
`In-Page Links` | ||
: Crawls only the specified URL and treats linked sections of the page as distinct pages. | ||
|
||
You must specify the protocol (likely `http://` or `https://`) as a part of the URL entered into this field. | ||
Any link that begins with the _Crawl Start URL_ followed by a hashtag symbol (`#`) and then a string is considered an in-page link. This is commonly used to link to a section of a page. For example, because the "Scope" section of this guide is linked by its heading as `/user-guide/workflow-setup/#scope` it would be treated as a separate page under the _In-Page Links_ scope. | ||
|
||
#### Start URL Scope | ||
This scope can also be useful for crawling websites that are single-page applications where each page has its own hash, such as `example.com/#/blog` and `example.com/#/about`. | ||
|
||
`Hashtag Links Only` | ||
: This scope will ignore links that lead to other addresses such as `example.com/path` and will instead instruct the crawler to visit hashtag links such as `example.com/#linkedsection`. | ||
#### Site Crawl | ||
|
||
This scope can be useful for crawling certain web apps that may not use unique URLs for their pages. | ||
|
||
`Pages in the Same Directory` | ||
`Pages in Same Directory` | ||
: This scope will only crawl pages in the same directory as the _Crawl Start URL_. If `example.com/path` is set as the _Crawl Start URL_, `example.com/path/path2` will be crawled but `example.com/path3` will not. | ||
|
||
`Pages on This Domain` | ||
`Pages on Same Domain` | ||
: This scope will crawl all pages on the domain entered as the _Crawl Start URL_ however it will ignore subdomains such as `subdomain.example.com`. | ||
|
||
`Pages on This Domain and Subdomains` | ||
`Pages on Same Domain + Subdomains` | ||
: This scope will crawl all pages on the domain and any subdomains found. If `example.com` is set as the _Crawl Start URL_, both pages on `example.com` and `subdomain.example.com` will be crawled. | ||
|
||
`Custom Page Prefix` | ||
: This scope will crawl all pages that begin with the _Crawl Start URL_ as well as pages from any URL that begin with the URLs listed in `Extra URL Prefixes in Scope` | ||
|
||
#### Max Depth | ||
### Page URL(s) | ||
|
||
Only shown with a _Start URL Scope_ of `Pages on This Domain` and above, the _Max Depth_ setting instructs the crawler to stop visiting new links past a specified depth. | ||
One or more URLs of the page to crawl. URLs must follow [valid URL syntax](https://www.w3.org/Addressing/URL/url-spec.html). For example, if you're crawling a page that can be accessed on the public internet, your URL should start with `http://` or `https://`. | ||
|
||
#### Extra URL Prefixes in Scope | ||
??? example "Crawling with HTTP basic auth" | ||
|
||
Only shown with a _Start URL Scope_ of `Custom Page Prefix`, this field accepts additional URLs or domains that will be crawled if URLs that lead to them are found. | ||
All crawl scopes support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:[email protected]`. | ||
|
||
**These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished. | ||
|
||
This can be useful for crawling websites that span multiple domains such as `example.org` and `example.net` | ||
### Crawl Start URL | ||
|
||
#### Include Any Linked Page ("one hop out") | ||
This is the first page that the crawler will visit. _Site Crawl_ scopes are based on this URL. | ||
|
||
When enabled, the crawler will visit all the links it finds within each page, regardless of the _Start URL Scope_ setting. | ||
### Include Any Linked Page | ||
|
||
When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field. | ||
|
||
??? example "Crawling tags & search queries with Page List crawls" | ||
This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page. | ||
|
||
### Fail Crawl on Failed URL | ||
|
||
When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed". | ||
|
||
### Max Depth in Scope | ||
|
||
Instructs the crawler to stop visiting new links past a specified depth. | ||
|
||
### Extra URL Prefixes in Scope | ||
|
||
This field accepts additional URLs or domains that will be crawled if URLs that lead to them are found. | ||
|
||
This can be useful for crawling websites that span multiple domains such as `example.org` and `example.net`. | ||
|
||
### Include Any Linked Page ("one hop out") | ||
|
||
When enabled, the crawler bypasses the _Crawl Scope_ setting to visit links it finds in each page within scope. The crawler will not visit links it finds in the pages found outside of scope (hence only "one hop out".) | ||
|
||
This can be useful for capturing links on a page that lead outside the website that is being crawled but should still be included in the archive for context. | ||
|
||
#### Check For Sitemap | ||
### Check For Sitemap | ||
|
||
When enabled, the crawler will check for a sitemap at /sitemap.xml and use it to discover pages to crawl if found. It will not crawl pages found in the sitemap that do not meet the crawl's scope settings or limits. | ||
|
||
This can be useful for discovering and capturing pages on a website that aren't linked to from the seed and which might not otherwise be captured. | ||
|
||
### Exclusions | ||
### Additional Pages | ||
|
||
A list of page URLs outside of the _Crawl Scope_ to include in the crawl. | ||
|
||
### Exclude Pages | ||
|
||
The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in Page List crawls when _Include Any Linked Page_ is enabled. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
API_BASE_URL= | ||
GLITCHTIP_DSN= | ||
DOCS_URL=https://docs.browsertrix.com/ | ||
GLITCHTIP_DSN= |
Oops, something went wrong.