Our scraper crawls Yale's websites to obtain the data we provide.
The scraper requires Redis as an additional dependency. On a Mac with Homebrew installed, you can get Redis with brew install redis
. For other platforms, install Redis using this guide or by googling "install redis [your platform]".
Start the redis server with redis-server
, or start a daemon with brew services start redis
.
To run the scraper process locally (not necessary if you want to view the website without user data), first start the Celery task manager:
./celery.sh
In order to actually execute the scraper, visit localhost:5000/scraper and fill in the fields. To retrieve the tokens you need, you'll want to use the developer tools ("inspect element") for your browser, specifically the Network tab, to view the headers on requests made to the Face Book and Directory. See below for more information on what headers to grab.
Visit yalies.io/scraper to view the scraper interface. Obtain the relevant tokens as detailed below. Check off "Departmental" in the list of caches to use, unless you also want to scrape the departmental websites, which you probably don't as that process takes a while and those websites rarely update.
If you are denied access to the scraper page, reach out to the team leader as you will probably need to be given admin privileges through the database in order to access this page.
After you've obtained the requisite tokens, simply click "Run Scraper" and verify that the button turns green.
Open the Yale Face Book and log in if necessary. In the developer tools, choose any request and grab the Cookie
property in its entirety.
Open the Yale Directory and log in if necessary. Perform a search, and in the developer tools, select the query to the api
endpoint. You'll notice the Cookie
is too long to be displayed without elipses, so right click and copy it elsewhere then extract only the _people_search_session
value. Then, grab the X-CSRF-Token
header value.
python objc[24386]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called
- Set the environment variable
OBJC_DISABLE_INITIALIZE_FORK_SAFETY
toYES
(this is a Mac High Sierra workaround)
- Set the environment variable
- "the client noticed that the server is not Elasticsearch and we do not support this unknown product"
- You have the wrong version of Elasticsearch installed, uninstall Elasticsearch with Pip and reinstall requirements.txt
- "
np.float_
was removed in the NumPy 2.0 release. Usenp.float64
instead"pip3 install "numpy<2"