simple scrapy project using python Scrapy for unidiscover to scrape university course information including but not limited to:
- course description
- salary
- employment
Final output as a csv table in discover_uni(1).csv.
Blow is part of the first 5 row:
courseidentifier | uniname | uniid | coursename | link | course_name | Study mode | Distance learning | Placement year | Year abroad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10008071/AAUNDERRADUATE5YEAR/Full-time | AA School of Architecture | 10008071 | MArch Architecture | /course-details/10008071/AAUNDERRADUATE5YEAR/Full-time | MArch Architecture | Full time | Not Available | Not Available | Not Available |
1 | 10007783/LV61/Full-time | University of Aberdeen | 10007783 | MA (Hons) Anthropology and History | /course-details/10007783/LV61/Full-time | MA (Hons) Anthropology and History | Full time | Not Available | Not Available | Optional |
2 | 10007783/LV65/Full-time | University of Aberdeen | 10007783 | MA (Hons) Anthropology and Philosophy | /course-details/10007783/LV65/Full-time | MA (Hons) Anthropology and Philosophy | Full time | Not Available | Not Available | Optional |
3 | 10007783/LR61/Full-time | University of Aberdeen | 10007783 | MA (Hons) Anthropology and French | /course-details/10007783/LR61/Full-time | MA (Hons) Anthropology and French | Full time | Not Available | Not Available | Compulsory |
4 | 10007783/LQ65/Full-time | University of Aberdeen | 10007783 | MA (Hons) Anthropology and Gaelic | /course-details/10007783/LQ65/Full-time | MA (Hons) Anthropology and Gaelic | Full time | Not Available | Not Available | Optional |
- Python 3.8+
- Works on Linux, Windows, macOS, BSD
The quick way:
pip install scrapy
scrapy crawl unispider -o course_data_40page.json
- output raw data as to course_data_40page.json as json file
- convert the json file into csv tabular format in convert json to csv.ipynb
- define output as items in items.py
- connect to a DB