Skip to content

Latest commit

 

History

History
157 lines (110 loc) · 5.58 KB

README.md

File metadata and controls

157 lines (110 loc) · 5.58 KB

Google Images Scraper

Google Images Scraper is a Python tool designed to scrape high-resolution images from Google Images based on provided links. It now supports multi-threading for faster scraping. This tool overcomes the limitations of some browser extensions that only download image thumbnails.

Table of Contents


Installation

  1. Clone the repository:

    git clone https://github.com/jwiedeman/google-images-scraper.git
  2. Navigate to the project directory:

    cd google-images-scraper
  3. Create the virtual environment:

    python -m venv .venv
  4. Activate the Virtual Environment:

    # For Linux
    source .venv/bin/activate
    
    # For Windows Powershell
    .venv/Scripts/Activate.ps1
    # For Windows Command Prompt
    .venv/Scripts/activate.bat
  5. Install the required dependencies:

    pip install -r requirements.txt

Usage

  1. Run the scraper by executing the following command:

    python main.py

    This script will fetch high-resolution images from Google Images based on the provided links using multi-threading for faster scraping.


Configuration

You can customize the behavior of the scraper by modifying the config.yaml file.

Email Configuration

  • sender_email: The email address used for sending notifications.
  • receiver_email: The email address to receive notifications.
  • sender_email_password: The password for the sender's email account.
  • send_email: Set True or False for sending emails.

Note: If you want to use the email notifications functionality with a Gmail account, it's recommended to generate an App Password instead of using your account password.

Search Queries

  • search_queries: List of search queries to use when scraping Google Images. You can add or remove queries as needed.

Images Limit

  • images_limit: Set the maximum number of images to download per category. Google tends to load a maximum of 250 images, but can be lower, 200 is reccomended.

Project Info

  • csv_downloads: Directory to store CSV files containg the original link to each image downloaded.
  • image_downloads: Directory to store downloaded images.
  • downloader.py: Contains class to download images using multi-threading.
  • email_service.py: Provides functionality for email notifications (if needed).
  • scraper.py: The main scraper class to initiate the scraping process with multi-threading.
  • config.yaml: Configuration file to set up email and scraping parameters.
  • link_saver.py: Handles saving image links.
  • main.py: The main entry point for running the Google Images Scraper.

Getting Started

In main.py, an instance of the Scraper class is created as follows:

sc = Scraper(num_threads=5, show_ui=True)
  • num_threads: You can customize the number of threads, which represents the total browser instances. More threads generally result in faster scraping, but it may increase resource usage. Adjust this value based on your system's capabilities and requirements.

  • show_ui: The show_ui option determines whether Selenium runs in headless mode or not. When set to True, it shows the browser UI during scraping. When set to False, it runs Selenium in headless mode, which means the browser operates in the background without a visible UI. Choose the appropriate setting based on your preference and needs.

The rest of the process is straightforward:

  1. Run the scraper by executing main.py:

    python main.py
  2. The scraper will start fetching high-resolution images from Google Images based on the provided links and configurations, using the specified number of threads and UI visibility.

  3. Monitor the scraping progress and any notifications sent via email, as configured in config.yaml.


Contributing

Contributions to Google Images Scraper are welcome and encouraged! To contribute, follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Make your changes and test thoroughly.
  4. Commit your changes with descriptive commit messages.
  5. Push your changes to your fork.
  6. Open a pull request, explaining the changes you've made.

License

This project is licensed under the MIT License.

Todo & Completed additions

  • Image deduplication, when saving images, we now ensure the same image isnt already saved using imagehash.
  • Image saved name uses the next index of the count of images in the result folder, this avoids overwriting images with the same name on subsequent crawls.
  • Removed sleeps to speed the process up, will wait for element visibility and immieditely continue.
  • Updated Selectors for element interaction
  • Added a new case to "scroll more" there are now 3 distinct messages we can recieve that may block scrolling unless clicked.
  • ML Image augmentation export
  • Export downloaded images as a YoloV[X] dataset.
  • Speed up image downloading process
  • Check images pre download against the downloaded links csv to avoid downloading then processing hashes for efficiency.
  • Import search terms via csv
  • CLI commands to clear folders, export, resume where left off

Disclaimer

This program lets you download tons of images from Google Images. Please do not download or use any image that violates its copyright terms.