-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement WARC spout #755
Comments
Just FYI, there's a Common Crawl fetcher that's part of the flink-crawler project, which we used extensively during testing as a way of having a large-scale, reproducible crawl. See https://github.com/ScaleUnlimited/flink-crawler/tree/master/src/main/java/com/scaleunlimited/flinkcrawler/fetcher/commoncrawl |
Thanks, @kkrugler: a fetcher which gets the content from WARC files via a CDX index would be a further step. For now, my idea was simpler: just consume a list of WARC files and emit all records into the topology. It's just a "replay" of an already recorded crawl. |
- add WARCSpout to warc module README - refactor WARCSpout - add WARC record location to metadata: warc.file.name, warc.record.offset and warc.record.length - upgrade jwarc dependency to 0.12.0
- after emited fetched tuple: sleep to avoid the topology queues overflow (configurable via `warc.spout.emit.fetched.sleep.ms`) - sleep 1 microsec. after "failed" fetches (HTTP status != 200)
- remove spout-internal sleep - upgrade jwarc dependency to 0.13.0
- emit content tuple with ID (URL)
* WARC spout to emit captures into topology (implements #755) * WARC spout to emit captures into topology (implements #755) - add WARCSpout to warc module README - refactor WARCSpout - add WARC record location to metadata: warc.file.name, warc.record.offset and warc.record.length - upgrade jwarc dependency to 0.12.0 * WARC spout to emit captures into topology (implements #755) - after emited fetched tuple: sleep to avoid the topology queues overflow (configurable via `warc.spout.emit.fetched.sleep.ms`) - sleep 1 microsec. after "failed" fetches (HTTP status != 200) * WARC record types (request, response, etc.) as String constants * WARC spout to emit captures into topology (implements #755) - remove spout-internal sleep - upgrade jwarc dependency to 0.13.0 * WARC spout to emit captures into topology (implements #755) - emit content tuple with ID (URL)
A "WARC spout" could read records from WARC files and emit tuples <URL, content, metadata> into the topology, similar as done by FetcherBolt. This would allow
The text was updated successfully, but these errors were encountered: