-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and add incremental workflow functionality. #78
Conversation
Some documentation polish is still needed. |
Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here. Just wanted to respond to one note in the description for context:
This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional. |
Thanks, I hadn't realized that. Good to know. |
Reviewer TODO to myself: check how the date boundary looks with the change in logic |
The download cache part of this is now in PR #80 |
…ansitfeeds behavior.
I decided to keep the existing boundary behavior here. I updated the PR so that the new code that pulls scraped schedules from S3 behaves consistently. |
Ready for review now! Comments welcome. This is now basically a superset of PR #80, so if desired you could just review this. |
For easier reviewing, try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?
Specifically, when I try to run update_data
, I get the following error:
python3 -m update_data
INFO:root: Searching page 1
INFO:root: Searching page 2
INFO:root: Searching page 3
INFO:root: Searching page 4
INFO:root: Found schedule for May 2022
INFO:root: Adding schedule for May 7, 2022
INFO:root:Processing 49 schedules.
Traceback (most recent call last):
File "/Users/laurie/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/laurie/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/laurie/git/hailey-fork-ghost-buses/update_data.py", line 394, in <module>
main()
File "/Users/laurie/git/hailey-fork-ghost-buses/update_data.py", line 367, in main
combined_long_df, summary_df = csrt.main(cache_manager, freq=freq, start_date=start_date, end_date=None,
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 296, in main
return summarizer.main(existing)
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 282, in main
this_iter = combiner.combine()
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/compare_scheduled_and_rt.py", line 142, in combine
schedule = self.schedule_summarizer.get_route_daily_summary()
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 109, in get_route_daily_summary
trip_summary = self.make_trip_summary()
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 127, in make_trip_summary
data = self.download_and_extract()
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/static_gtfs_analysis.py", line 241, in download_and_extract
cta_gtfs = zipfile.ZipFile(self.gtfs_fetcher.retrieve_file(self.schedule_feed_info))
File "/Users/laurie/git/hailey-fork-ghost-buses/data_analysis/gtfs_fetcher.py", line 103, in retrieve_file
local_filename, _, s3_filename, _ = self.versions[version_id]
KeyError: '20220507'
Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError
?
...
update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:
|
It was a simple fix for the non-arguments version, which I just committed. It should work now. |
Ok, I've updated the readme and addressed open comments. I think this is ready for another round of testing and viewing. Please let me know what you think about deferring the transitfeeds changes. |
Description
Add the ability to incrementally update data, refactoring the processing workflow to make this feasible.
This refactors the existing logic into multiple new classes and splits realtime and schedule processing
as much as possible. Included are a few other features:
Resolves # [issue]
Type of change
How has this been tested?
Manually