Skip to content

Latest commit

 

History

History
214 lines (140 loc) · 8.47 KB

File metadata and controls

214 lines (140 loc) · 8.47 KB

MTA Subway Alert Affected Riders

Project for the MTA Open Data Challenge 2024.

platform demo

This project provides an interactive visualization platform (mta-subway-alert-affected-riders.vercel.app) that maps the relationship between MTA subway service disruptions and ridership patterns. By correlating service alerts with station entry data, we visualize the number of riders potentially affected by service disruptions through an interactive heatmap and station-level grid cells.

timelapse Watch the full timelapse on YouTube

The analysis spans 31 months (February 2022 through August 2024), allowing users to select any date and explore:

  • A dynamic heatmap showing potentially affected ridership across the subway system
  • A station-level grid cell map for detailed analysis
  • Interactive timeline views for each station showing relevant service alerts throughout the day
  • Hover functionality revealing detailed station and alert information

Important Disclaimer: My estimation provides an upper bound of affected riders. The actual number of riders affected is likely significantly lower because:

  • Not all stops along a disrupted line are necessarily affected by the reported incident
  • Some riders may have alternative routes available
  • The alert may affect only a specific segment of the line
  • Some riders may have been informed of the disruption before entering the station

Datasets

data-relation-and-project-screenshot

This repository contains code to process New York MTA (Metropolitan Transportation Authority) data from three main datasets:

  1. MTA Subway Stations - Station locations and route information
  2. MTA Service Alerts - Real-time service alerts and disruptions (2020-04-28 to 2024-08-30 when accessed)
  3. MTA Subway Hourly Ridership - Hourly ridership data by station (2022-02-01 to 2024-10-01 when accessed)

Note: This project analyzes the overlapping period between the alerts and ridership datasets (2022-02-01 to 2024-08-30).

Data Preparation

Pre-processed Data

For convenience, a pre-processed version of the data is available in on Google Drive. You can use these files directly if you don't need to process the raw data yourself.

Processing Raw Data

Due to size limitations, the original datasets are not included in this repository. Please download them from the official sources linked above.

  1. Install required dependencies:
pip install -r requirements.txt
  1. Download the CSV files from the links above and place them in a datasets folder with the following names:

    • MTA_Subway_Stations_20241024.csv
    • MTA_Service_Alerts__Beginning_April_2020_20241014.csv
    • MTA_Subway_Hourly_Ridership__Beginning_February_2022_20241014.csv
  2. Run the data preparation script.

python data-preparation.py

What does the script do

The script data-preparation.py processes these CSV files and generates TSV files. The processed files will be created in the /data folder:

  • mta_stations.tsv
  • mta_subway_alerts.tsv
  • mta_subway_hourly_ridership.tsv

The script performs several key transformations:

  • Filters for subway lines of interest
  • Converts timestamps to standardized format
  • Structures station information for geospatial analysis
  • Prepares data in TSV format optimized for PostgreSQL import

Populating Database

Create Database Schema

First, create the database tables and indices by running the SQL schema:

psql -d your_database_name -f db-schema.sql

This creates:

  • Custom enum type for subway lines
  • Three main tables with appropriate constraints and indices
  • GIN indices for efficient array operations
  • B-tree indices for timestamp-based queries

Database Structure

The schema defines three main tables:

  1. subway_stops: Stores station information

    • Primary key: complex_id
    • Contains: station coordinates and served subway lines
    • Includes spatial validation constraints for coordinates
    • GIN index on lines array for efficient line-based queries
  2. mta_alerts: Stores service disruption alerts

    • Primary key: alert_id
    • Contains: alert details, timestamp, and affected subway lines
    • Includes unique constraint on alert and event IDs
    • Indexed on timestamp and affected_lines for efficient temporal and line-based queries
  3. hourly_ridership: Stores station entry data

    • Composite primary key: (timestamp, complex_id)
    • Contains: hourly ridership counts for each station
    • Foreign key relationship with subway_stops
    • Indexed for efficient temporal and station-based queries

Loading Data

After creating the schema, populate the tables with the TSV files generated from the data preparation step:

-- Load subway stations data
\copy subway_stops FROM 'data/mta_stations.tsv' WITH DELIMITER E'\t' CSV HEADER;

-- Load service alerts
\copy mta_alerts FROM 'data/mta_subway_alerts.tsv' WITH DELIMITER E'\t' CSV HEADER;

-- Load hourly ridership data
\copy hourly_ridership FROM 'data/mta_subway_hourly_ridership.tsv' WITH DELIMITER E'\t' CSV HEADER;

The schema includes appropriate indices and constraints to ensure data integrity and query performance. The GIN indices on array columns (lines and affected_lines) are particularly important for efficiently finding stations affected by specific service disruptions.

Deploy the Website

The visualization platform is built with NextJS, which provides both the frontend interface and backend API endpoints to query the database.

Note: The core data aggregation logic for calculating potentially affected ridership is implemented in the backend API router at website/src/server/api/routers/mta-alert.ts. This TypeScript file contains the queries and algorithms for:

  • Correlating alerts with station ridership
  • Calculating temporal overlaps
  • Aggregating affected passenger counts

Setup Development Environment

  1. Navigate to the website directory:
cd website
  1. Install dependencies:
npm install
  1. Configure database connection:
    • Create a .env file in the website directory
    • Add your database connection URL to the .env file:
DATABASE_URL="postgresql://username:password@host:port/database"

Development

Run the development server:

npm run dev

The site will be available at http://localhost:3000. The development server includes:

  • Hot reloading for real-time code changes
  • API route testing
  • Development error messages

Production Deployment

Build the production version:

npm run build

After building, you can start the production server:

npm run start

The website (mta-subway-alerts-affected-rider.vercel.app) provides:

  • Interactive heatmap of potentially affected ridership
  • Station-level grid cell visualization
  • Timeline views of service alerts
  • Date selection for historical analysis

Note: Ensure your database is accessible from your deployment environment and the DATABASE_URL is properly configured in your production environment.

Preliminary Data Analysis

The /analysis directory contains exploratory visualizations examining the relationship between MTA service alerts and ridership patterns from February 2022 through August 2024.

While these visualizations offer insights into weekly patterns, monthly trends, and correlation between alerts and affected ridership, they use a simplified methodology that provides upper-bound estimates.

The analysis counts riders entering stations with disrupted lines within 30 minutes of alerts, but could be improved by considering specific line segments affected, alternative routes, alert severity, and transfer patterns. For a more detailed view, visit the interactive visualization platform at (mta-subway-alerts-affected-rider.vercel.app), which implements some of these methodological improvements.