Skip to content

Latest commit

 

History

History
34 lines (25 loc) · 2.83 KB

README.md

File metadata and controls

34 lines (25 loc) · 2.83 KB

Scraper

Our scraper crawls Yale's websites to obtain the data we provide.

Running the scraper on your local machine

The scraper requires Redis as an additional dependency. On a Mac with Homebrew installed, you can get Redis with brew install redis. For other platforms, install Redis using this guide or by googling "install redis [your platform]".

Start the redis server with redis-server, or start a daemon with brew services start redis.

To run the scraper process locally (not necessary if you want to view the website without user data), first start the Celery task manager:

./celery.sh

In order to actually execute the scraper, visit localhost:5000/scraper and fill in the fields. To retrieve the tokens you need, you'll want to use the developer tools ("inspect element") for your browser, specifically the Network tab, to view the headers on requests made to the Face Book and Directory. See below for more information on what headers to grab.

Running the scraper on production

Visit yalies.io/scraper to view the scraper interface. Obtain the relevant tokens as detailed below. Check off "Departmental" in the list of caches to use, unless you also want to scrape the departmental websites, which you probably don't as that process takes a while and those websites rarely update.

If you are denied access to the scraper page, reach out to the team leader as you will probably need to be given admin privileges through the database in order to access this page.

After you've obtained the requisite tokens, simply click "Run Scraper" and verify that the button turns green.

Face Book

Open the Yale Face Book and log in if necessary. In the developer tools, choose any request and grab the Cookie property in its entirety.

Directory

Open the Yale Directory and log in if necessary. Perform a search, and in the developer tools, select the query to the api endpoint. You'll notice the Cookie is too long to be displayed without elipses, so right click and copy it elsewhere then extract only the _people_search_session value. Then, grab the X-CSRF-Token header value.

Common Issues

  • python objc[24386]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called
    • Set the environment variable OBJC_DISABLE_INITIALIZE_FORK_SAFETY to YES (this is a Mac High Sierra workaround)
  • "the client noticed that the server is not Elasticsearch and we do not support this unknown product"
    • You have the wrong version of Elasticsearch installed, uninstall Elasticsearch with Pip and reinstall requirements.txt
  • "np.float_ was removed in the NumPy 2.0 release. Use np.float64 instead"
    • pip3 install "numpy<2"