Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertrix Crawler v1.1.4
What's Changed
- tests: fix blockrules tests by @ikreymer in #603
- recorder: add missing shouldSkip() to responseReceived callback by @ikreymer in #602
- Change some logged errors to warns by @tw4l in #600
- Fix synching extraSeeds state with multiple crawler instances by @ikreymer in #605
- dependency: update RWP to 2.0.1 by @ikreymer in #610
- add undici for 1.1.4 release, to fix #606 by @ikreymer in #608
- Fix header newline escape by @ikreymer in #609
Full Changelog: v1.1.3...v1.1.4
Browsertrix Crawler v1.2.0-beta.0
What's Changed
- Bump version to 1.2.0 Beta + make draft release for each commit by @ikreymer in #582
- Always add warcinfo records to all WARCs by @ikreymer in #556
- Load non-HTML resources directly whenever possible by @ikreymer in #583
- base image version bump to brave 1.66.115 by @ikreymer in #592
- Add group policies, limit browser access to container filesystem by @vnznznz in #579
- cleanup dockerfile + fix test by @ikreymer in #595
- Consider disk usage of collDir instead of default /crawls by @benoit74 in #586
- add --dryRun flag and mode by @ikreymer in #594
- proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars by @ikreymer in #589
Full Changelog: v1.1.3...v1.2.0-beta.0
Browsertrix Crawler v1.1.3
What's Changed
- Mention command line options when restarting by @edsu in #577
- save state: export pending list as array of json strings + fix importing save state to support pending by @ikreymer in #576
- Sitemap Parsing Fixes by @ikreymer in #578
- Fix failOnFailedLimit and add tests by @tw4l in #580
Full Changelog: v1.1.2...v1.1.3
Browsertrix Crawler v1.1.2
What's Changed
- improved handling of requests from workers: by @ikreymer in #562
- Skip Checking Empty Frame + eval timeout by @ikreymer in #564
- add STORE_REGION env var to be able to specify region by @ikreymer in #565
- PDF loading status code fix by @ikreymer in #571
- Fix regressions with
failOnFailedSeed
option by @tw4l in #572 - headers: better filtering and encoding by @ikreymer in #573
Full Changelog: v1.1.1...v1.1.2
Browsertrix Crawler v1.1.1
What's Changed
- Avoid crashes when editing / creating profile and navigation is interrupted
- profiles: ensure all page.goto() promises have at least catch block/a… by @ikreymer in #559
- profiles: ensure initial page.load() is awaited by @ikreymer in #561
Full Changelog: v1.1.0...v1.1.1
Browsertrix Crawler v1.1.0
Major Features
Support for QA Crawling (https://crawler.docs.browsertrix.com/user-guide/qa/)
What's Changed
- QA Crawl Support (Beta) by @ikreymer in #469
- Use RFC2606 invalid domain names by @vnznznz in #514
- Fixes from 1.0.3 release -> main by @ikreymer in #517
- Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
- upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
- avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
- add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520
- Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
- Make /app world-readable to better support non-root usage by @vnznznz in #523
- merge V1.0.4 change -> main: by @ikreymer in #527
- Revert "Make /app world-readable to better support non-root usage" by @ikreymer in #529
- ensure all warcwriter write operations go through a queue. by @ikreymer in #528
- qa/replay crawl loading improvements by @ikreymer in #526
- Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
- Adblock support by @ikreymer in #534
- Remove no longer needed invalid Brave update URLs by @tw4l in #539
- Better logging of all queue WARCWriter operations by @ikreymer in #536
- qa: filter out non-html pages by @ikreymer in #541
- Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
- Set mime type for html pages by @tw4l in #545
- allow minio to connect to other regions by @mguella in #543
- replay counts: don't filter out URLs with __wb_method to avoid dispar… by @ikreymer in #552
- Add crawler QA docs by @tw4l in #551
- Support site-specific wait via browsertrix-behaviors by @ikreymer in #555
- warcinfo: fix version to 1.1 to avoid confusion (part of #553) by @ikreymer in #557
New Contributors
Full Changelog: v1.0.4...v1.1.0
Browsertrix Crawler 1.1.0 Beta 5
What's Changed
- Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
- Adblock support by @ikreymer in #534
- Remove no longer needed invalid Brave update URLs by @tw4l in #539
- Better logging of all queue WARCWriter operations by @ikreymer in #536
- qa: filter out non-html pages by @ikreymer in #541
- Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
- Set mime type for html pages by @tw4l in #545
Full Changelog: v1.1.0-beta.4...v1.1.0-beta.5
v1.1.0-beta.4
What's Changed
- Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
- refactor handling of max size for html/js/css by @ikreymer in #525
- merge V1.0.4 change -> main: by @ikreymer in #527
- ensure all warcwriter write operations go through a queue. by @ikreymer in #528
- qa/replay crawl loading improvements by @ikreymer in #526
Full Changelog: v1.1.0-beta.3...v1.1.0-beta.4
Browsertrix Crawler v1.0.4
What's Changed
- refactor handling of max size for html/js/css by @ikreymer in #525
Fix for #522, issues loading pages with large streaming js/css
Full Changelog: v1.0.3...v1.0.4
Browsertrix Crawler 1.1.0 Beta 3 (QA Support)
What's Changed
- Use RFC2606 invalid domain names by @vnznznz in #514
- Fixes from 1.0.3 release -> main by @ikreymer in #517
- Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
- upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
- avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
- add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520
Full Changelog: v1.1.0-beta.2...v1.1.0-beta.3