Multithreaded Web Scraper with Logging

Implementation of a multi-threaded web scraper. The scraped websites are captured in a log. In case of an error or cancel, the scraping can restart from the last seen page. It's possible to set keywords which will be searched for while scraping. A page's content will only be downloaded if it contains those keywords.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Java 11 (previous versions have not been tested)
Jsoup
Apache Maven

Installing

Download source code
cd mwsl
configure settings in /src/main/resources/config.properties
mvn clean install in project root
cd target
java -classpath jarName-jar-with-dependencies.jar lenngro.mwsl.WebScraper

Note: sudo rights might be required in order to create new directories while saving the scraped pages.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.idea		.idea
src		src
README.md		README.md
log.txt		log.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multithreaded Web Scraper with Logging

Getting Started

Prerequisites

Installing

About

Releases

Packages

Languages

lenngro/mwsl

Folders and files

Latest commit

History

Repository files navigation

Multithreaded Web Scraper with Logging

Getting Started

Prerequisites

Installing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages