Skip to content

Commit

Permalink
add an extra --postLoadDelay param to specify how many seconds to wai…
Browse files Browse the repository at this point in the history
…t after page-load (#520)

but before running link extraction, text extraction, screenshots and
behaviors.

Useful for sites that load quickly but perform async loading / init
afterwards, fixes #519

A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
  • Loading branch information
ikreymer authored Mar 29, 2024
1 parent ea098b6 commit 2059f2b
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 0 deletions.
7 changes: 7 additions & 0 deletions docs/docs/user-guide/common-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ See [page.goto waitUntil options](https://pptr.dev/api/puppeteer.page.goto#remar

The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.

### Additional Wait

Occasionally, a page may seem to have loaded, but performs dynamic initialization / additional loading. This is can be hard to detect, and the `--postLoadDelay` flag
can be used to specify additional seconds to wait after the page appears to have loaded, before moving on to post-processing actions, such as link extraction, screenshotting and text extraction (see below).

(On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)

## Ad Blocking

Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).
Expand Down
7 changes: 7 additions & 0 deletions src/crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1802,6 +1802,13 @@ self.__bx_behaviors.selectMainBehavior();

await this.netIdle(page, logDetails);

if (this.params.postLoadDelay) {
logger.info("Awaiting post load delay", {
seconds: this.params.pagePostLoadDelay,
});
await sleep(this.params.pagePostLoadDelay);
}

// skip extraction if at max depth
if (seed.isAtMaxDepth(depth) || !selectorOptsList) {
logger.debug("Skipping Link Extraction, At Max Depth");
Expand Down
7 changes: 7 additions & 0 deletions src/util/argParser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,13 @@ class ArgParser {
type: "number",
},

postLoadDelay: {
describe:
"If >0, amount of time to sleep (in seconds) after page has loaded, before taking screenshots / getting text / running behaviors",
default: 0,
type: "number",
},

pageExtraDelay: {
alias: "delay",
describe:
Expand Down

0 comments on commit 2059f2b

Please sign in to comment.