-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0572 spider chi ssa 38 #962
base: main
Are you sure you want to change the base?
Conversation
…wrote spider parse_start function. WHY: wrote parse_start to extract unstructured date from page
…veloped functionality on a test set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This is looking good so far, let me know if any of my comments aren't clear
|
||
def _parse_classification(self, item): | ||
"""Parse or generate classification from allowed options.""" | ||
return NOT_CLASSIFIED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be COMMISSION
for all meetings on this spider
def _parse_description(self, item): | ||
"""Parse or generate meeting description.""" | ||
description = "" | ||
return description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to just return "" instead of setting a variable first
def _parse_title(self, item): | ||
"""Parse or generate meeting title.""" | ||
title = "Chamber of Commerce" | ||
return title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned in _parse_description
, but it's fine to just return the string without assigning to a variable first. It's a bit odd for SSAs, but this one should be "Commission". They're technically separate entities managed by a nonprofit
"name": "Northcenter Chamber of Commerce", | ||
} | ||
|
||
def _parse_links(self, item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to parse a mapping of dates to relevant links from the page so that things like meeting minutes can be associated with the meetings listed. Here's an example of that:
city-scrapers/city_scrapers/spiders/chi_il_medical_district.py
Lines 109 to 140 in c6771d5
def _parse_link_date_map(self, response): | |
"""Generate a defaultdict mapping of meeting dates and associated links""" | |
link_date_map = defaultdict(list) | |
for link in response.css( | |
".vc_col-sm-4.column_container:nth-child(1) .mk-text-block.indent16" | |
)[:1].css("a"): | |
link_str = link.xpath("./text()").extract_first() | |
link_start = self._parse_start(link_str) | |
if link_start: | |
link_date_map[link_start.date()].append( | |
{ | |
"title": re.sub(r"\s+", " ", link_str.split(" – ")[-1]).strip(), | |
"href": link.attrib["href"], | |
} | |
) | |
for section in response.css( | |
".vc_col-sm-4.column_container:nth-child(1) .vc_tta-panel" | |
): | |
year_str = section.css(".vc_tta-title-text::text").extract_first().strip() | |
for section_link in section.css("p > a"): | |
link_str = section_link.xpath("./text()").extract_first() | |
link_dt = self._parse_start(link_str, year=year_str) | |
if link_dt: | |
link_date_map[link_dt.date()].append( | |
{ | |
"title": re.sub( | |
r"\s+", " ", link_str.split(" – ")[-1] | |
).strip(), | |
"href": section_link.xpath("@href").extract_first(), | |
} | |
) | |
return link_date_map |
return False | ||
|
||
def _parse_location(self, item): | ||
# Meetings seemingly ocurred at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like an abbreviated version of this is in the meeting item, since we don't have examples of another format we can do something like this to return a default location if "4054" is present and raise an exception otherwise:
if "4054" not in item.extract():
raise ValueError("Meeting location has changed")
return {
"address": "4054 N Lincoln Ave, Chicago, IL 60618",
"name": "Northcenter Chamber of Commerce",
}
|
||
def _parse_start(self, item): | ||
"""Parse start datetime as a naive datetime object.""" | ||
date_str = item.extract() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we might be able to simplify this a bit, and we'll also need to handle situations where minutes may be supplied for the time. Haven't tested this, but something like this snippet could work:
item_str = item.extract()
month_day_str = re.search(r"[A-Z][a-z]{2,9} \d{1,2}", item_str).group()
year_str = re.search(r"\d{4}", item_str).group()
if not year_str[:2] == "20":
year_str = str(datetime.today().year) # Default to current year
time_match = re.search(r"\d{1,2}(\:\d\d) [apm\.]{2,4}", item_str) # We want to check for a minutes portion here
time_str = "12 am"
if time_match:
time_str = time_match.group().replace(".", "")
time_fmt = "%I %p"
if ":" in time_str:
time_fmt = "%I:%M %p"
return datetime.strptime(f"{month_day_str} {year_str} {time_str}", f"%B %d %Y {time_str}")
Summary
Issue: #572
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Questions
Include any questions you have about what you're working on.