-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Tung Lin edited this page May 22, 2023
·
4 revisions
We are using Supabase to host a PostgreSQL database. Rather than using Supabase's Python client library, we are using SQLAlchemy ORM to insert and query the database. This allows us to move to a different host when the time comes.
The database credentials are in a Google Drive. Contact Tung, Wilson, or Tony for access (read+write).
In the first web_scraping folder (ls should return scrapy.cfg and another web_scraping folder). Run
scrapy crawl [spider name]
For example:
scrapy crawl searchspider
Depending on the spider, a .json file will be created/rewritten in web_crawling/jsons.
Currently:
- statespider --> states.json (state, url)
- munispider --> municipalities.json (state, municipality, url)
- searchspider --> parking_code.json (state, municipality, state_url, parking_code)
Loops through every entry of municipalities.json and follows the URL for the municipality.
Using scrapy-playwright, for each request it:
- waits 6 seconds for JS to load
- types a keyword into the search bar
- presses "Enter" key
- waits 6 seconds for the results to load
- results page is sent to parse_search to find the URL with parking code
To resolve:
- how to find the right link with parking code (currently we're extracting the first link)
- when municipality URL redirects to a site that is not municode
- if a keyword does not return any results