This code downloads papers in parallel from ArXiv given search parameters. The papers are saved in the pdf format. The script used the ArXiv API to access the database. The code is written in Python3.
A large number of papers downloading is prohibited. Please see ArXiv API rules. Besides, using a third-party scaper (such as Scrapy library) to download papers from ArXiv is strictly prohibited. Please see the ArXiv site rule
- Create a virtualenv - How to create virtualenv
- Activate the virtualenv - 'source path/to/bin/activate'
- Run 'pip install -r requirements.txt'
After crating virtual environment, clone this repo and run:
python arXiv_pdf_downloader.py --search <searchQuery> --search_by <searchBy> --sort_by <sortBy> --max_result <maxResult> --folder_name<folderName>
Query parameters are defined below
- searchQuery: Search query, could be any string (required)
- searchBy: Seach by, choose one from (optional)
- 'ti': Title
- 'au': Author
- 'abs': Abstract
- 'co': Comment
- 'jr': Journal Reference
- 'cat': Subject Category
- 'rn': Report Number
- 'all': All of the above (default)
- sortBy: Parameter to use for sorting, choose one from (optional)
- 'relevance' (default)
- 'lastUpdatedDate'
- 'submittedDate'
- maxResult: Number of requested paper. It is possible to receive papers less than requested number is not enough papers matched with the query (optional, default=1).
- folderName: Name of the folder where downloaded paper will be stored (optional, default='fetched_pdf').
Example of valid parameters and format:
- searchQuery: 'nlp transformer'
- searchBy: 'all'
- sortBy: 'relevance'
- maxResult: 5
- folderName: 'arXiv papers'
Run testing via command
python test_pdf_downloader.py