website crawler - "source" #5355

mishavay-aws · 2025-01-24T15:52:23Z

Is your feature request related to a problem? Please describe.
This is a new feature request. We'd like to have Data Prepper have web site crawling capabilities that can crawl web sites and facilitate the ingestion of web pages into OpenSearch.

Describe the solution you'd like
Introduce a "webcrawler source" that would provide ability to crawl a public website on a periodic basis (on-demand or a schedule), respecting the configuration of the website and rate-limiting of the requests, filtering (including/excluding pages), etc... On pages that were acquired, the ability to store content in OpenSearch for search and discovery.

Describe alternatives you've considered (Optional)
Use of a Selenium web crawler and a Chromium driver, then filtering, enriching the content, and storing it in OpenSearch.

Additional context
N/A

mishavay-aws added the untriaged label Jan 24, 2025

github-project-automation bot added this to Data Prepper Tracking Board Jan 24, 2025

github-project-automation bot moved this to Unplanned in Data Prepper Tracking Board Jan 24, 2025

mishavay-aws changed the title ~~A website crawler source~~ website crawler - "source" Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

website crawler - "source" #5355

website crawler - "source" #5355

mishavay-aws commented Jan 24, 2025 •

edited

Loading

website crawler - "source" #5355

website crawler - "source" #5355

Comments

mishavay-aws commented Jan 24, 2025 • edited Loading

mishavay-aws commented Jan 24, 2025 •

edited

Loading