Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement WARC spout #755

Closed
sebastian-nagel opened this issue Sep 23, 2019 · 2 comments · Fixed by #799
Closed

Implement WARC spout #755

sebastian-nagel opened this issue Sep 23, 2019 · 2 comments · Fixed by #799
Assignees
Milestone

Comments

@sebastian-nagel
Copy link
Contributor

A "WARC spout" could read records from WARC files and emit tuples <URL, content, metadata> into the topology, similar as done by FetcherBolt. This would allow

  • to reproducibely test and benchmark the topology (parser, indexer and status update bolts) with the same "frozen" data
  • to use parts of a crawl topology to process historic data from WARC files using the same parsers and indexers
  • to easily rewrite or (re)index WARC files, cf. wish: WARCHdfsBolt with CDX index #567
@kkrugler
Copy link

Just FYI, there's a Common Crawl fetcher that's part of the flink-crawler project, which we used extensively during testing as a way of having a large-scale, reproducible crawl. See https://github.com/ScaleUnlimited/flink-crawler/tree/master/src/main/java/com/scaleunlimited/flinkcrawler/fetcher/commoncrawl

@sebastian-nagel
Copy link
Contributor Author

Thanks, @kkrugler: a fetcher which gets the content from WARC files via a CDX index would be a further step. For now, my idea was simpler: just consume a list of WARC files and emit all records into the topology. It's just a "replay" of an already recorded crawl.

sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue May 27, 2020
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue May 27, 2020
@jnioche jnioche added this to the 1.17 milestone Jun 15, 2020
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Jun 21, 2020
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Jun 21, 2020
- add WARCSpout to warc module README
- refactor WARCSpout
- add WARC record location to metadata:
  warc.file.name, warc.record.offset and warc.record.length
- upgrade jwarc dependency to 0.12.0
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Jun 21, 2020
- after emited fetched tuple: sleep to avoid the topology queues
  overflow (configurable via `warc.spout.emit.fetched.sleep.ms`)
- sleep 1 microsec. after "failed" fetches (HTTP status != 200)
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Jun 23, 2020
- remove spout-internal sleep
- upgrade jwarc dependency to 0.13.0
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Jun 24, 2020
jnioche pushed a commit that referenced this issue Jun 25, 2020
* WARC spout to emit captures into topology (implements #755)

* WARC spout to emit captures into topology (implements #755)
- add WARCSpout to warc module README
- refactor WARCSpout
- add WARC record location to metadata:
  warc.file.name, warc.record.offset and warc.record.length
- upgrade jwarc dependency to 0.12.0

* WARC spout to emit captures into topology (implements #755)
- after emited fetched tuple: sleep to avoid the topology queues
  overflow (configurable via `warc.spout.emit.fetched.sleep.ms`)
- sleep 1 microsec. after "failed" fetches (HTTP status != 200)

* WARC record types (request, response, etc.) as String constants

* WARC spout to emit captures into topology (implements #755)
- remove spout-internal sleep
- upgrade jwarc dependency to 0.13.0

* WARC spout to emit captures into topology (implements #755)
- emit content tuple with ID (URL)
@jnioche jnioche linked a pull request Jun 25, 2020 that will close this issue
@jnioche jnioche closed this as completed Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants