Implementation of a multi-threaded web scraper. The scraped websites are captured in a log. In case of an error or cancel, the scraping can restart from the last seen page. It's possible to set keywords which will be searched for while scraping. A page's content will only be downloaded if it contains those keywords.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Java 11 (previous versions have not been tested)
- Jsoup
- Apache Maven
- Download source code
cd mwsl
- configure settings in /src/main/resources/config.properties
mvn clean install
in project rootcd target
java -classpath jarName-jar-with-dependencies.jar lenngro.mwsl.WebScraper
Note: sudo
rights might be required in order to create new directories while saving the scraped pages.