EDIT: Follow up on analysis and the whole deal on my blog (with notebook source). Summary
We got a modest improvement in
filterdata
and more than 10x speedup inhistdata
, but actually lost out tonumpy
for finding the average of 2 arrays inmpoint
!
filterdata
:
pandas
: 19.1 ms ± 284 µsduckdb
: 6.53 ms ± 126 µsmpoint
:
numpy
: 403 µs ± 5.35 µsduckdb
: 1.7 ms ± 82.6 µshistdata
:
pandas
+numpy
: 40.8 ms ± 430 µsduckdb
: 2.93 ms ± 28.4 µs
I got interested in this DuckDB + Pyarrow blogpost on how their zero-copy integration can make for fast analysis on larger than memory datasets.
I re-wrote the data load function in the Uber Pickups dataset example to use pyarrow
and duckdb
with pretty promising results. Next step: the analysis!
@st.experimental_singleton
def load_data():
data = csv.read_csv('uber-raw-data-sep14.csv.gz', convert_options=csv.ConvertOptions(
include_columns=["Date/Time","Lat","Lon"],
timestamp_parsers=['%m/%d/%Y %H:%M:%S']
)).rename_columns(['date/time', 'lat', 'lon'])
# We transform the dataset into a DuckDB relation
data = duckdb.arrow(data)
return data.arrow().to_pandas()
Live in streamlit cloud
Rest of github code