PubMed Scraper

Welcome to the PubMed Scraper repo! This tool is designed for efficient extraction of the PubMed Open Access Subset, a resource comprising millions of freely available journal articles and preprints.

These documents are distributed under licenses permitting reuse, making them an ideal dataset for academic research, data analysis, and machine learning projects. Utilizing the PMC FTP Service, this repo provides a solution for bulk downloading, parsing the nested XML files in the tar.gz's, and storing the PMC OA subset data in a structured PostgreSQL database, ready for analysis.

Getting Started 🚀

Ensure you have:

Python 3.11 or later 🐍
Access to a PostgreSQL database 🗄️
Required Python packages: beautifulsoup4, psycopg2-binary, python-dotenv, pydantic

Installation

Clone the repository to your local machine.
Install the required Python packages by running pip install -r requirements.txt
Set up your environment variables for database access
- DB_USER
- DB_PASS
- DB_HOST
- DB_PORT
- DB_NAME
The database should be set up with the following table:

CREATE TABLE pub_med_papers (
    pmid TEXT PRIMARY KEY,
    title TEXT NOT NULL,
    abstract TEXT,
    full_text TEXT,
    authors TEXT
);

Running the Scraper

To launch the scraper, make sure your database is up and running, then execute the main script from the command line:

python app/main.py

Result

The scraper will download the PMC OA subset, parse the XML files, and store the data in the PostgreSQL database:

Features ✨

Bulk Download: Automates the retrieval of bulk datasets from the PMC FTP service.
XML Parsing: Efficiently extracts key information from complex XML structures into an organized format.
Database Storage: Neatly stores extracted data in a PostgreSQL database, facilitating easy retrieval and analysis.
Deep Extraction of Nested Files: Extracts and parses nested XML files from compressed .tar.gz archives
Postprocessing with Worker Queues: Employs worker queue system for efficient postprocessing of downloaded content.
Efficient Database Connection Pooling: Implements a connection pool to efficiently manage database interactions, reducing overhead

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
img		img
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMed Scraper

Getting Started 🚀

Installation

Running the Scraper

Result

Features ✨

About

Releases

Packages

Languages

Shavvimal/PubMedScraper

Folders and files

Latest commit

History

Repository files navigation

PubMed Scraper

Getting Started 🚀

Installation

Running the Scraper

Result

Features ✨

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages