Holy Duck! Full Uber Pickups dataset with DuckDB + Pyarrow

EDIT: Follow up on analysis and the whole deal on my blog (with notebook source). Summary

We got a modest improvement in filterdata and more than 10x speedup in histdata, but actually lost out to numpy for finding the average of 2 arrays in mpoint!

  • filterdata:
    • pandas: 19.1 ms ± 284 µs
    • duckdb: 6.53 ms ± 126 µs
  • mpoint:
    • numpy: 403 µs ± 5.35 µs
    • duckdb: 1.7 ms ± 82.6 µs
  • histdata:
    • pandas + numpy: 40.8 ms ± 430 µs
    • duckdb: 2.93 ms ± 28.4 µs

I got interested in this DuckDB + Pyarrow blogpost on how their zero-copy integration can make for fast analysis on larger than memory datasets.

I re-wrote the data load function in the Uber Pickups dataset example to use pyarrow and duckdb with pretty promising results. Next step: the analysis!

def load_data():
    data = csv.read_csv('uber-raw-data-sep14.csv.gz', convert_options=csv.ConvertOptions(
        timestamp_parsers=['%m/%d/%Y %H:%M:%S']
    )).rename_columns(['date/time', 'lat', 'lon'])

    # We transform the dataset into a DuckDB relation
    data = duckdb.arrow(data)
    return data.arrow().to_pandas()

Live in streamlit cloud

Rest of github code


Great work