-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to chicago library spyder #1133
base: main
Are you sure you want to change the base?
Fix to chicago library spyder #1133
Conversation
the last item extracted causes an issue when there's multiple items with the strong css, like with the current chicago library website.
WalkthroughThe changes in the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
city_scrapers/spiders/chi_library.py (2)
Line range hint
77-81
: Critical: Incomplete fix for multiple dates in strong tagsWhile the change from
[-1]
to[0]
addresses the immediate issue, it might still miss additional meeting dates when multiple strong tags are present. We should handle all dates found in strong tags.Consider this approach:
- dt_str = item.css("strong::text").extract()[0] - return datetime.datetime.strptime( - "{} {}".format(re.sub(r"[,\.]", "", dt_str), year), "%A %B %d %I %p %Y" - ) + dates = [] + for dt_str in item.css("strong::text").extract(): + try: + date = datetime.datetime.strptime( + "{} {}".format(re.sub(r"[,\.]", "", dt_str), year), + "%A %B %d %I %p %Y" + ) + dates.append(date) + except ValueError: + continue + if not dates: + raise ValueError("No valid dates found in strong tags") + return dates[0] # Return the first valid date found
Line range hint
83-102
: Improve robustness of link validationThe current implementation could benefit from better error handling and performance optimizations.
Consider these improvements:
def _parse_links(self, start_time): """Check if agenda and minutes are valid URLs, add to documents if so""" agenda_url = ( "https://www.chipublib.org/news/board-of-directors-" "meeting-agenda-{}-{date.day}-{date.year}/" ).format( start_time.strftime("%B").lower(), date=start_time, ) minutes_url = agenda_url.replace("agenda", "minutes") - agenda_res = self.session.get(agenda_url) - minutes_res = self.session.get(minutes_url) + documents = [] + timeout = 10 # seconds + + def check_url(url, doc_type): + try: + res = self.session.get(url, timeout=timeout) + if res.status_code == 200: + documents.append({"href": url, "title": doc_type}) + except (requests.RequestException, requests.Timeout): + self.logger.warning(f"Failed to fetch {doc_type} URL: {url}") + + check_url(agenda_url, "Agenda") + check_url(minutes_url, "Minutes") - documents = [] - if agenda_res.status_code == 200: - documents.append({"href": agenda_url, "title": "Agenda"}) - if minutes_res.status_code == 200: - documents.append({"href": minutes_url, "title": "Minutes"}) return documents
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
city_scrapers/spiders/chi_library.py
(1 hunks)
🔇 Additional comments (2)
city_scrapers/spiders/chi_library.py (2)
Line range hint 16-21
: LGTM: Good practice to inject the session dependency
The session injection allows for better testing and performance through connection reuse.
Line range hint 1-102
: Verify integration with the scraping framework
Let's ensure the changes are properly integrated with the framework and covered by tests.
✅ Verification successful
Implementation is properly integrated and well-tested
The verification shows:
- Complete test coverage exists in
tests/test_chi_library.py
with comprehensive test cases for all attributes - The spider follows the framework's patterns by:
- Properly extending
CityScrapersSpider
- Using framework constants (BOARD, TENTATIVE)
- Following the expected Meeting item structure
- Including proper mocking and test utilities from the core framework
- Properly extending
- The implementation matches the framework's requirements for parsing meetings and yielding standardized Meeting items
The changes are well-integrated with the scraping framework and have thorough test coverage.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for spider usage and test coverage
# Check for spider usage in tests
echo "Checking for test coverage..."
rg -l "ChiLibrarySpider" tests/
# Check for similar patterns in other spiders
echo "Checking for similar patterns in other spiders..."
rg -l "strong::text" city_scrapers/spiders/
# Check for existing test cases with multiple dates
echo "Checking for test cases with multiple dates..."
rg -A 5 "multiple.*dates" tests/
Length of output: 454
Script:
#!/bin/bash
# Let's examine the test file and spider implementation patterns
# Check the test implementation
echo "Examining test implementation..."
cat tests/test_chi_library.py
# Check for similar spider patterns in the framework
echo "Checking spider patterns..."
ast-grep --pattern 'class $_Spider(CityScrapersSpider) {
$$$
def parse($_, $_) {
$$$
}
$$$
}'
# Check core spider implementation
echo "Checking core spider implementation..."
fd "core" --type f --exec rg -l "CityScrapersSpider"
Length of output: 2611
Summary
Issue: #N/A
There's an issue with the Chicago Library spyder where if there's multiple dates with the strong tag then it'd go with the last one which currently causes an issue with the website if ran.
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor