Implementation of minimal map reduce sliding aggregation algorithm in pyspark:
Authors of algortithm: Yufei Tao, Wenqing Lin, Xiaokui Xiao
Link to paper describing algorithm:
https://dl.acm.org/doi/10.1145/2463676.2463719
https://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf
Yellow Taxi Trip Records (CSV) data from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page for January 2021. For each record I've computed the average ride distance and the average passenger occupancy during the last 1000 rides. The algorithm is minimal and follows the one from the paper. It Uses Spark RDD API Python.