Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

website crawler - "source" #5355

Open
mishavay-aws opened this issue Jan 24, 2025 · 0 comments
Open

website crawler - "source" #5355

mishavay-aws opened this issue Jan 24, 2025 · 0 comments

Comments

@mishavay-aws
Copy link
Contributor

mishavay-aws commented Jan 24, 2025

Is your feature request related to a problem? Please describe.
This is a new feature request. We'd like to have Data Prepper have web site crawling capabilities that can crawl web sites and facilitate the ingestion of web pages into OpenSearch.

Describe the solution you'd like
Introduce a "webcrawler source" that would provide ability to crawl a public website on a periodic basis (on-demand or a schedule), respecting the configuration of the website and rate-limiting of the requests, filtering (including/excluding pages), etc... On pages that were acquired, the ability to store content in OpenSearch for search and discovery.

Describe alternatives you've considered (Optional)
Use of a Selenium web crawler and a Chromium driver, then filtering, enriching the content, and storing it in OpenSearch.

Additional context
N/A

@mishavay-aws mishavay-aws changed the title A website crawler source website crawler - "source" Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant