Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler v1.1.4

14 Jun 02:16
9094a83
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.1.3...v1.1.4

Browsertrix Crawler v1.2.0-beta.0

10 Jun 20:19
e2b4cc1
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Bump version to 1.2.0 Beta + make draft release for each commit by @ikreymer in #582
  • Always add warcinfo records to all WARCs by @ikreymer in #556
  • Load non-HTML resources directly whenever possible by @ikreymer in #583
  • base image version bump to brave 1.66.115 by @ikreymer in #592
  • Add group policies, limit browser access to container filesystem by @vnznznz in #579
  • cleanup dockerfile + fix test by @ikreymer in #595
  • Consider disk usage of collDir instead of default /crawls by @benoit74 in #586
  • add --dryRun flag and mode by @ikreymer in #594
  • proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars by @ikreymer in #589

Full Changelog: v1.1.3...v1.2.0-beta.0

Browsertrix Crawler v1.1.3

21 May 23:48
Compare
Choose a tag to compare

What's Changed

  • Mention command line options when restarting by @edsu in #577
  • save state: export pending list as array of json strings + fix importing save state to support pending by @ikreymer in #576
  • Sitemap Parsing Fixes by @ikreymer in #578
  • Fix failOnFailedLimit and add tests by @tw4l in #580

Full Changelog: v1.1.2...v1.1.3

Browsertrix Crawler v1.1.2

15 May 18:09
1735c3d
Compare
Choose a tag to compare

What's Changed

  • improved handling of requests from workers: by @ikreymer in #562
  • Skip Checking Empty Frame + eval timeout by @ikreymer in #564
  • add STORE_REGION env var to be able to specify region by @ikreymer in #565
  • PDF loading status code fix by @ikreymer in #571
  • Fix regressions with failOnFailedSeed option by @tw4l in #572
  • headers: better filtering and encoding by @ikreymer in #573

Full Changelog: v1.1.1...v1.1.2

Browsertrix Crawler v1.1.1

02 May 16:01
22b2136
Compare
Choose a tag to compare

What's Changed

  • Avoid crashes when editing / creating profile and navigation is interrupted
  • profiles: ensure all page.goto() promises have at least catch block/a… by @ikreymer in #559
  • profiles: ensure initial page.load() is awaited by @ikreymer in #561

Full Changelog: v1.1.0...v1.1.1

Browsertrix Crawler v1.1.0

19 Apr 04:57
15d2b09
Compare
Choose a tag to compare

Major Features

Support for QA Crawling (https://crawler.docs.browsertrix.com/user-guide/qa/)

What's Changed

  • QA Crawl Support (Beta) by @ikreymer in #469
  • Use RFC2606 invalid domain names by @vnznznz in #514
  • Fixes from 1.0.3 release -> main by @ikreymer in #517
  • Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
  • upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
  • avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
  • add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520
  • Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
  • Make /app world-readable to better support non-root usage by @vnznznz in #523
  • merge V1.0.4 change -> main: by @ikreymer in #527
  • Revert "Make /app world-readable to better support non-root usage" by @ikreymer in #529
  • ensure all warcwriter write operations go through a queue. by @ikreymer in #528
  • qa/replay crawl loading improvements by @ikreymer in #526
  • Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
  • Adblock support by @ikreymer in #534
  • Remove no longer needed invalid Brave update URLs by @tw4l in #539
  • Better logging of all queue WARCWriter operations by @ikreymer in #536
  • qa: filter out non-html pages by @ikreymer in #541
  • Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
  • Set mime type for html pages by @tw4l in #545
  • allow minio to connect to other regions by @mguella in #543
  • replay counts: don't filter out URLs with __wb_method to avoid dispar… by @ikreymer in #552
  • Add crawler QA docs by @tw4l in #551
  • Support site-specific wait via browsertrix-behaviors by @ikreymer in #555
  • warcinfo: fix version to 1.1 to avoid confusion (part of #553) by @ikreymer in #557

New Contributors

Full Changelog: v1.0.4...v1.1.0

Browsertrix Crawler 1.1.0 Beta 5

15 Apr 21:53
efebc33
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
  • Adblock support by @ikreymer in #534
  • Remove no longer needed invalid Brave update URLs by @tw4l in #539
  • Better logging of all queue WARCWriter operations by @ikreymer in #536
  • qa: filter out non-html pages by @ikreymer in #541
  • Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
  • Set mime type for html pages by @tw4l in #545

Full Changelog: v1.1.0-beta.4...v1.1.0-beta.5

v1.1.0-beta.4

05 Apr 01:14
c247189
Compare
Choose a tag to compare
v1.1.0-beta.4 Pre-release
Pre-release

What's Changed

  • Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
  • refactor handling of max size for html/js/css by @ikreymer in #525
  • merge V1.0.4 change -> main: by @ikreymer in #527
  • ensure all warcwriter write operations go through a queue. by @ikreymer in #528
  • qa/replay crawl loading improvements by @ikreymer in #526

Full Changelog: v1.1.0-beta.3...v1.1.0-beta.4

Browsertrix Crawler v1.0.4

03 Apr 22:23
a3f93ca
Compare
Choose a tag to compare

What's Changed

  • refactor handling of max size for html/js/css by @ikreymer in #525
    Fix for #522, issues loading pages with large streaming js/css

Full Changelog: v1.0.3...v1.0.4

Browsertrix Crawler 1.1.0 Beta 3 (QA Support)

29 Mar 00:21
Compare
Choose a tag to compare

What's Changed

  • Use RFC2606 invalid domain names by @vnznznz in #514
  • Fixes from 1.0.3 release -> main by @ikreymer in #517
  • Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
  • upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
  • avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
  • add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520

Full Changelog: v1.1.0-beta.2...v1.1.0-beta.3