A web content scraper utilizing Playwright and Claude LLM through AWS Bedrock to intelligently extract and structure data from web pages.
Currently optimized for extracting League of Legends champion tactics articles.
- Intelligent web content extraction using Claude LLM
- Structured data output in JSONL format
- Configurable schema-based parsing
- AWS Bedrock integration
- Logging support
- Docker
- Python 3.12+
- AWS Account with Bedrock access
- AWS CLI configured
- Clone the repository:
git clone https://github.com/haandol/claude-web-scraper.git
cd claude-web-scraper
- Install dependencies:
pip install -r requirements.txt
- Install Playwright dependencies:
playwright install
- Configure environment variables:
Create a
.env
file in the project root with:
MODEL_ID=us.anthropic.claude-3-5-haiku-20241022-v1:0
AWS_PROFILE_NAME=your_profile_name
AWS_REGION=your_aws_region
-
open
app.py
and modifyurl
,OutputSchema
andinstruction
-
Run the scraper:
python app.py
The script will:
- Crawl the specified data webpage
- Extract article information using Claude LLM
- Save the results in
output/output.jsonl
, unless you specify a different path.
This project is licensed under the MIT License - see the LICENSE file for details.