New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

0673 spider chi ohare noise #972

Open

stmfunk wants to merge 13 commits into City-Bureau:main from stmfunk:0673-spider-chi_ohare_noise

stmfunk commented Sep 9, 2020 •

edited

Loading

Summary

Issue: #673

Finished creating spider for scraping data from the Chicago O'Hare Noise Compatibility Commission.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

stmfunk added 13 commits

July 9, 2020 14:14


          First commit for chicago ohare noise


          Added code to parse meeting tile

15b6cf6


          Added request code to retrieve subpages. Added code to retrieve start…

fcca613

… and end times from root page. Need to modify routines to retrieve data from subpages and format as meeting items


          Updated _parse_title to retrieve from subpage

16973d3


          Added location parsing

f7147ad


          Completed data collection from basic page. Next item on the agenda is…

c6d69e0

… to unify data from other urls to obtain minutes and agenda urls


          Finished the scraper ohare1. Next item is to unify the data from both…

ec1a12e

… scrapers


          Finished merging scrapers. Was unable to mesh data on crawl, should i…

58fb97b

…deally be tried in database. Fixed formatting with flake8, isort and black. Begin making tests


          Deleted swap files

964e2ab


          Implemented status and classification methods.

e6d8234


          Added vim swap file to gitignore

dc8f7fb


          Added all primary tests for first subspider. Matching tests for secon…

f3e3066

…d subspider should round out the code.


          Finished making tests. Ran and finalized linting and style-checking

4eb0409

pjsier requested changes

View reviewed changes

Collaborator

pjsier left a comment

Thanks for the PR! This one has been challenging for a while, so appreciate you taking it on. I left some initial comments here, let me know what you think or if any of the comments aren't clear

city_scrapers/spiders/chi_ohare_noise.py

+                  agency = "Chicago O'Hare Noise Compatibility Commission"
+                  timezone = "America/Chicago"
+                  class ChiOhareNoiseSubSpider1:

Collaborator

pjsier Sep 10, 2020

Thanks for picking this one up! Looks like it was one of the more difficult ones. I think we'll want to keep it limited to one class here though for readability and split out methods for each (something like _parse_events_start, _parse_docs_start).

I could be missing some complexity here though, is there a reason for not doing that?

city_scrapers/spiders/chi_ohare_noise.py

+                          """
+                          for item in response.css(".ev_td_li"):
+                              surl = self._parse_url(item)
+                              yield Request(url=response.urljoin(surl), callback=self._parse_details)

Collaborator

pjsier Sep 10, 2020

Might be simpler to use the response.follow shortcut method here like you did below

city_scrapers/spiders/chi_ohare_noise.py

+                              yield Request(url=response.urljoin(surl), callback=self._parse_details)
+                          next_page = response.xpath("//div[@class='previousmonth']/a/@href").get()
+                          if next_page is not None:

Collaborator

pjsier Sep 10, 2020

We might want to limit this to only calling the next page once so that we don't hammer the site with a ton of requests every time.

city_scrapers/spiders/chi_ohare_noise.py

+                              source=response.url,
+                          )
+                          meeting = self._get_status(meeting)

Collaborator

pjsier Sep 10, 2020

We'll want to pass the text kwarg with the raw title here so that we can check whether or not it contains "cancelled" which is handled for you in get_status.

Also, we should make sure that get_status is assigning meeting['status'] and not overwriting the meeting variable entirely. That goes for _parse_classification as well

city_scrapers/spiders/chi_ohare_noise.py

+                              title=self._parse_title(response).replace("CANCELLED ", "").strip("- "),
+                              description=self._parse_description(response),
+                              start=stime,
+                              end=stime + timedelta(hours=1),

Collaborator

pjsier Sep 10, 2020

We can leave this as None if we aren't parsing it and potentially pass a string to the time_notes kwarg with any details about scheduling

city_scrapers/spiders/chi_ohare_noise.py

+                          elif "commission" in meeting["title"].lower():
+                              meeting["classification"] = COMMISSION
+                          else:
+                              meeting["classification"] = NOT_CLASSIFIED

Collaborator

pjsier Sep 10, 2020

Since the overall organization is a commission it's safe to default to COMMISSION instead of NOT_CLASSIFIED

logput

		@@ -0,0 +1,890 @@
		2020-08-28 21:56:35 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: city_scrapers)

Collaborator

pjsier Sep 10, 2020

This can be removed

output

		@@ -0,0 +1,20 @@
		[]

Collaborator

pjsier Sep 10, 2020

This can be removed as well

tests/test_chi_ohare_noise.py



		parsed_sub_items = []
		for i in range(5):

Collaborator

pjsier Sep 10, 2020

Ideally we should try not to have more than 2 test files here, and running some logic in parse in a separate method that doesn't yield a request could make that easier. That way we could call something like _parse_detail without needing to worry about testing pagination

tests/test_chi_ohare_noise.py

+              def test_source():
+                  src = "https://www.oharenoise.org/about-oncc"
+                  src += "/oncc-meetings/month.calendar/2019/02/03/-"
+                  assert parsed_sub_items[0]["source"] == src

Collaborator

pjsier Sep 10, 2020

This one is pretty safe, and we can shorten it a bit and get at what we're testing with == test_response.url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet