Skip to content

ClaudeCrawl is clone of FireCrawl's web scraping feature

License

Notifications You must be signed in to change notification settings

haandol/claudecrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClaudeCrawl

A web content scraper utilizing Playwright and Claude LLM through AWS Bedrock to intelligently extract and structure data from web pages.

Currently optimized for extracting League of Legends champion tactics articles.

Simple Overview

Features

  • Intelligent web content extraction using Claude LLM
  • Structured data output in JSONL format
  • Configurable schema-based parsing
  • AWS Bedrock integration
  • Logging support

Prerequisites

  • Docker
  • Python 3.12+
  • AWS Account with Bedrock access
  • AWS CLI configured

Installation

  1. Clone the repository:
git clone https://github.com/haandol/claude-web-scraper.git
cd claude-web-scraper
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright dependencies:
playwright install
  1. Configure environment variables: Create a .env file in the project root with:
MODEL_ID=us.anthropic.claude-3-5-haiku-20241022-v1:0
AWS_PROFILE_NAME=your_profile_name
AWS_REGION=your_aws_region

Usage

  1. open app.py and modify url, OutputSchema and instruction

  2. Run the scraper:

python app.py

The script will:

  • Crawl the specified data webpage
  • Extract article information using Claude LLM
  • Save the results in output/output.jsonl, unless you specify a different path.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

ClaudeCrawl is clone of FireCrawl's web scraping feature

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published