Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare realtime schedule #69

Open
wants to merge 63 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
ae0cb03
add date range option for downloads.
dcjohnson24 Aug 30, 2023
68a55bd
Add branch for GitHub actions
dcjohnson24 Aug 30, 2023
640b30a
Fix typing error with List[str, str]
dcjohnson24 Aug 30, 2023
26169ad
Add bucket name argument to download_fileobj
dcjohnson24 Sep 4, 2023
9c2a93d
Change Zipfile to ZipFile
dcjohnson24 Sep 4, 2023
48a1d53
Add case for nothing to check in transitfeeds
dcjohnson24 Sep 4, 2023
ec95194
Test with date range
dcjohnson24 Sep 4, 2023
98cac8e
Add 2022 data from transitfeeds to s3
dcjohnson24 Sep 5, 2023
3ea9630
Shorten date range for testing
dcjohnson24 Sep 5, 2023
0e6ceb9
Create data.json
dcjohnson24 Sep 18, 2023
4eb490f
add dependency
dcjohnson24 Sep 18, 2023
984de27
Add date_range argument
dcjohnson24 Sep 19, 2023
c671d61
Remove date formatting. Handled by React
dcjohnson24 Sep 19, 2023
fd7a87d
Add cta_download argument to create_GTFS_data_list
dcjohnson24 Sep 19, 2023
14c8756
Turn of fail-fast
dcjohnson24 Sep 19, 2023
e6b4d53
save transitfeeds zipfiles to s3
dcjohnson24 Sep 25, 2023
2ce5458
Fix syntax error
dcjohnson24 Sep 25, 2023
43e3945
confirm files exist
dcjohnson24 Sep 25, 2023
0a9a706
call seek on BytesIO
dcjohnson24 Sep 25, 2023
68f425b
Convert list of lists to list
dcjohnson24 Sep 25, 2023
0f67748
download from s3 instead of transitfeeds
dcjohnson24 Sep 26, 2023
ce6251a
test saving data.json
dcjohnson24 Sep 27, 2023
e8e5c7c
Search for transitfeeds files after cta files
dcjohnson24 Sep 27, 2023
1d8c481
Change filter date range for transitfeeds zipfiles
dcjohnson24 Sep 30, 2023
625a510
Correct the filename on transitfeeds zip
dcjohnson24 Sep 30, 2023
b550387
Add manual workflow to backfill transitfeeds data
dcjohnson24 Oct 1, 2023
c4049b2
Add data artifacts
dcjohnson24 Oct 1, 2023
d9498c3
add logger messages for s3 downloads
dcjohnson24 Oct 1, 2023
0023be6
add more logging
dcjohnson24 Oct 1, 2023
fc20b83
Change to ubuntu 20.04. Return to commit b550387a
dcjohnson24 Oct 2, 2023
799361f
Testing macos-latest
dcjohnson24 Oct 2, 2023
1f156d4
Start a shell for failed runs
dcjohnson24 Oct 2, 2023
2d9e9c9
return zipfile from s3 without extraction
dcjohnson24 Oct 3, 2023
a73d3dd
Add python debugger
dcjohnson24 Oct 3, 2023
5cafc67
Fix syntax error
dcjohnson24 Oct 3, 2023
445795b
Add more prints
dcjohnson24 Oct 3, 2023
e629cb0
Change loop
dcjohnson24 Oct 3, 2023
8c5d79d
Fix dictionary syntax
dcjohnson24 Oct 3, 2023
de38990
remove tmate
dcjohnson24 Oct 3, 2023
acb7d4d
Add cta_download argument to GTFSFeed.extract_data
dcjohnson24 Oct 7, 2023
d563c04
More print statements for start and end date
dcjohnson24 Oct 8, 2023
fe6f522
change save path of output JSON
dcjohnson24 Oct 10, 2023
b3dc77f
Make sure the save paths are the same
dcjohnson24 Oct 13, 2023
5f9a09d
Merge branch 'main' into compare-realtime-schedule
dcjohnson24 Oct 14, 2023
bddb22c
Use main function from compare_scheduled_and_rt.py
dcjohnson24 Oct 15, 2023
9f646ce
Remove cta_download arg
dcjohnson24 Oct 15, 2023
ef11107
fix key error
dcjohnson24 Oct 15, 2023
dc4d3d9
Add save_path argument
dcjohnson24 Oct 17, 2023
c84fc8c
change path name
dcjohnson24 Oct 17, 2023
cc02a8f
Add .json extension. Create scratch folder
dcjohnson24 Oct 17, 2023
3f89030
Add lineplot json data
dcjohnson24 Oct 20, 2023
87eddee
create schedule_feeds if None
dcjohnson24 Oct 21, 2023
81d0669
move Path.mkdir to cta_data_downloads.py
dcjohnson24 Oct 21, 2023
4db28f0
Check the save path in update_data.py
dcjohnson24 Oct 22, 2023
f89f9f8
create GeoJSON files. Fix save paths
dcjohnson24 Oct 22, 2023
2e4e28e
create 'ratio' column
dcjohnson24 Oct 22, 2023
31139ba
fix path name
dcjohnson24 Oct 23, 2023
675bb00
Fix path name
dcjohnson24 Oct 23, 2023
0134f35
Add .json to file path
dcjohnson24 Oct 23, 2023
bc85e5f
remove datetime in JSON
dcjohnson24 Oct 24, 2023
80fac25
convert lineplot_json to bytes
dcjohnson24 Oct 24, 2023
23f9653
Add workflow_dispatch.
dcjohnson24 Jan 9, 2024
d5fa57b
Pass day_type to summary_gdf_geo to create correct rankings
dcjohnson24 Jan 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 55 additions & 27 deletions .github/workflows/cta_data_downloads.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
name: Automate CTA schedule and realtime downloads

on:
workflow_dispatch:
push:
branches:
- 'automate-schedule-downloads'
- 'date-range-downloads'
- 'compare-realtime-schedule'

schedule:
# Run every day at 12:30pm CST which is 5:30pm UTC
Expand All @@ -11,9 +17,11 @@ env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

# Changing ubuntu to macos might resolve some timeout issues
# See https://github.com/actions/runner-images/issues/6680
jobs:
download-cta-schedule-data:
runs-on: ubuntu-latest
runs-on: macos-latest

steps:
- uses: actions/checkout@v3
Expand All @@ -29,40 +37,60 @@ jobs:
python -c 'from scrape_data.cta_data_downloads import save_cta_zip; \
save_cta_zip()' \
$AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY


# save-schedule-daily-summary:
# runs-on: ubuntu-latest
save-schedule-daily-summary:
runs-on: macos-latest

# steps:
# - uses: actions/checkout@v3
steps:
- uses: actions/checkout@v3

# - uses: actions/setup-python@v4
# with:
# python-version: ${{ env.PYTHON_VERSION }}
- uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

# - name: 'Save schedule summaries'
# run: |
# pip install -r requirements.txt
# python -c 'from scrape_data.cta_data_downloads import save_sched_daily_summary; \
# save_sched_daily_summary()' $AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY
- name: 'Save schedule summaries'
# Test with no date and with date range
run: |
pip install -r requirements.txt
python -c 'from scrape_data.cta_data_downloads import save_sched_daily_summary; \
save_sched_daily_summary()' $AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY

save-realtime-daily-summary:
runs-on: macos-latest

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

# save-realtime-daily-summary:
# runs-on: ubuntu-latest
- name: 'Save realtime summaries'

# steps:
# - uses: actions/checkout@v3
run: |
pip install -r requirements.txt

python -c 'from scrape_data.cta_data_downloads import save_realtime_daily_summary; \
save_realtime_daily_summary()' $AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY


save-frontend-map-json:
runs-on: macos-latest
needs: [save-realtime-daily-summary, save-schedule-daily-summary]
strategy:
fail-fast: false
steps:
- uses: actions/checkout@v3

# - uses: actions/setup-python@v4
# with:
# python-version: ${{ env.PYTHON_VERSION }}
- uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

# - name: 'Save realtime summaries'
- name: 'Save data.json for frontend'

# run: |
# pip install -r requirements.txt
run: |
pip install -r requirements.txt

# python -c 'from scrape_data.cta_data_downloads import save_realtime_daily_summary; \
# save_realtime_daily_summary()' $AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY

python -c 'from scrape_data.cta_data_downloads import compare_realtime_sched; \
compare_realtime_sched()' $AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY
38 changes: 38 additions & 0 deletions .github/workflows/transitfeeds-backfill-s3.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Download transitfeeds.com zipfiles and save to s3

on:
workflow_dispatch:
inputs:
start_date:
description: 'Start date in YYYY-MM-DD format'
required: false
type: string
end_date:
description: 'End date in YYYY-MM-DD format e.g. 2023-05-20'
required: false
type: string

env:
PYTHON_VERSION: 3.10.6
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
save-transitfeeds-data:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Download and save transitfeeds.com schedule data

run: |
pip install -r requirements.txt
python -c 'from scrape_data.cta_data_downloads import save_transitfeeds_zip; \
save_transitfeeds_zip(start_date=${{ inputs.start_date }}, end_date=${{ inputs.end_date}})' \
$AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY

64 changes: 51 additions & 13 deletions data_analysis/compare_scheduled_and_rt.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,24 +314,33 @@ def build_summary(
)
return summary


def main(freq: str = 'D') -> Tuple[List[dict],pd.DataFrame, pd.DataFrame]:
"""Calculate the summary by route and day across multiple schedule versions
def create_GTFS_data_list(schedule_feeds: List[dict]) -> dict:
""" Create list of GTFS data for each schedule version

Args:
freq (str): Frequency of aggregation. Defaults to Daily.
schedule_feeds (List[dict]): List of dictionaries with the keys
'schedule_version', 'feed_start_date', and 'feed_end_date'.

Returns:
pd.DataFrame: A DataFrame of every day in the specified data with
scheduled and observed count of trips.
pd.DataFrame: A DataFrame summary across
versioned schedule comparisons.
dict: A dictionary with keys 'GTFS_data_list' and 'schedule_data_list'.
'GTFS_data_list' is a list of dictionaries with the keys 'schedule_version'
and 'data', which is the extracted data from the GTFS zip file.
'schedule_data_list' is a list of dictionaries with the same keys as 'GTFS_data_list',
except that 'data' here is the route_daily_summary.
"""
schedule_feeds = create_schedule_list(month=5, year=2022)


GTFS_data_list = []
schedule_data_list = []
pbar = tqdm(schedule_feeds)
for feed in pbar:
schedule_version = feed["schedule_version"]
# Files with .zip suffix come from the CTA directly.
# Otherwise, they come from transitfeeds.com
if schedule_version.endswith('.zip'):
cta_download = True
else:
cta_download = False
pbar.set_description(
f"Generating daily schedule data for "
f"schedule version {schedule_version}"
Expand All @@ -340,15 +349,18 @@ def main(freq: str = 'D') -> Tuple[List[dict],pd.DataFrame, pd.DataFrame]:
f"\nDownloading zip file for schedule version "
f"{schedule_version}"
)
CTA_GTFS = static_gtfs_analysis.download_zip(schedule_version)
CTA_GTFS, _ = static_gtfs_analysis.download_zip(schedule_version)
logger.info("\nExtracting data")
data = static_gtfs_analysis.GTFSFeed.extract_data(
CTA_GTFS,
version_id=schedule_version
version_id=schedule_version,
cta_download=cta_download
)
data = static_gtfs_analysis.format_dates_hours(data)
GTFS_data_list.append({'fname': schedule_version, 'data': data})

logger.info("\nSummarizing trip data")

trip_summary = static_gtfs_analysis.make_trip_summary(data,
pendulum.from_format(feed['feed_start_date'], 'YYYY-MM-DD'),
pendulum.from_format(feed['feed_end_date'], 'YYYY-MM-DD'))
Expand All @@ -360,8 +372,34 @@ def main(freq: str = 'D') -> Tuple[List[dict],pd.DataFrame, pd.DataFrame]:

schedule_data_list.append(
{"schedule_version": schedule_version,
"data": route_daily_summary}
"data": route_daily_summary}
)


return {
'GTFS_data_list': GTFS_data_list,
'schedule_data_list': schedule_data_list
}

def main(freq: str = 'D', schedule_feeds: List[dict] = None
) -> Tuple[List[dict],pd.DataFrame, pd.DataFrame]:
"""Calculate the summary by route and day across multiple schedule versions

Args:
freq (str): Frequency of aggregation. Defaults to Daily.
schedule_feeds (List[dict]): List of dictionaries with the keys
'schedule_version', 'feed_start_date', and 'feed_end_date'.
Returns:
pd.DataFrame: A DataFrame of every day in the specified data with
scheduled and observed count of trips.
pd.DataFrame: A DataFrame summary across
versioned schedule comparisons.
"""
if schedule_feeds is None:
schedule_feeds = create_schedule_list(month=5, year=2022)

schedule_data_list = create_GTFS_data_list(schedule_feeds)['schedule_data_list']

agg_info = AggInfo(freq=freq)
combined_long, combined_grouped = combine_real_time_rt_comparison(
schedule_feeds,
Expand Down
Loading