-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0673 spider chi ohare noise #972
base: main
Are you sure you want to change the base?
Conversation
… and end times from root page. Need to modify routines to retrieve data from subpages and format as meeting items
… to unify data from other urls to obtain minutes and agenda urls
…deally be tried in database. Fixed formatting with flake8, isort and black. Begin making tests
…d subspider should round out the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This one has been challenging for a while, so appreciate you taking it on. I left some initial comments here, let me know what you think or if any of the comments aren't clear
agency = "Chicago O'Hare Noise Compatibility Commission" | ||
timezone = "America/Chicago" | ||
|
||
class ChiOhareNoiseSubSpider1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this one up! Looks like it was one of the more difficult ones. I think we'll want to keep it limited to one class here though for readability and split out methods for each (something like _parse_events_start
, _parse_docs_start
).
I could be missing some complexity here though, is there a reason for not doing that?
""" | ||
for item in response.css(".ev_td_li"): | ||
surl = self._parse_url(item) | ||
yield Request(url=response.urljoin(surl), callback=self._parse_details) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be simpler to use the response.follow
shortcut method here like you did below
yield Request(url=response.urljoin(surl), callback=self._parse_details) | ||
|
||
next_page = response.xpath("//div[@class='previousmonth']/a/@href").get() | ||
if next_page is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to limit this to only calling the next page once so that we don't hammer the site with a ton of requests every time.
source=response.url, | ||
) | ||
|
||
meeting = self._get_status(meeting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to pass the text
kwarg with the raw title here so that we can check whether or not it contains "cancelled" which is handled for you in get_status
.
Also, we should make sure that get_status
is assigning meeting['status']
and not overwriting the meeting variable entirely. That goes for _parse_classification
as well
title=self._parse_title(response).replace("CANCELLED ", "").strip("- "), | ||
description=self._parse_description(response), | ||
start=stime, | ||
end=stime + timedelta(hours=1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave this as None
if we aren't parsing it and potentially pass a string to the time_notes
kwarg with any details about scheduling
elif "commission" in meeting["title"].lower(): | ||
meeting["classification"] = COMMISSION | ||
else: | ||
meeting["classification"] = NOT_CLASSIFIED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the overall organization is a commission it's safe to default to COMMISSION
instead of NOT_CLASSIFIED
@@ -0,0 +1,890 @@ | |||
2020-08-28 21:56:35 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: city_scrapers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed
@@ -0,0 +1,20 @@ | |||
[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed as well
|
||
|
||
parsed_sub_items = [] | ||
for i in range(5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should try not to have more than 2 test files here, and running some logic in parse
in a separate method that doesn't yield a request could make that easier. That way we could call something like _parse_detail
without needing to worry about testing pagination
def test_source(): | ||
src = "https://www.oharenoise.org/about-oncc" | ||
src += "/oncc-meetings/month.calendar/2019/02/03/-" | ||
assert parsed_sub_items[0]["source"] == src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is pretty safe, and we can shorten it a bit and get at what we're testing with == test_response.url
Summary
Issue: #673
Finished creating spider for scraping data from the Chicago O'Hare Noise Compatibility Commission.
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.