Holy Duck! Full Uber Pickups dataset with DuckDB + Pyarrow

gerardrbentley · April 25, 2022, 7:41pm

EDIT: Follow up on analysis and the whole deal on my blog (with notebook source). Summary

We got a modest improvement in filterdata and more than 10x speedup in histdata, but actually lost out to numpy for finding the average of 2 arrays in mpoint!

filterdata:

pandas: 19.1 ms ± 284 µs

duckdb: 6.53 ms ± 126 µs

mpoint:

numpy: 403 µs ± 5.35 µs

duckdb: 1.7 ms ± 82.6 µs

histdata:

pandas + numpy: 40.8 ms ± 430 µs

duckdb: 2.93 ms ± 28.4 µs

I got interested in this DuckDB + Pyarrow blogpost on how their zero-copy integration can make for fast analysis on larger than memory datasets.

I re-wrote the data load function in the Uber Pickups dataset example to use pyarrow and duckdb with pretty promising results. Next step: the analysis!

@st.experimental_singleton
def load_data():
    data = csv.read_csv('uber-raw-data-sep14.csv.gz', convert_options=csv.ConvertOptions(
        include_columns=["Date/Time","Lat","Lon"],
        timestamp_parsers=['%m/%d/%Y %H:%M:%S']
    )).rename_columns(['date/time', 'lat', 'lon'])

    # We transform the dataset into a DuckDB relation
    data = duckdb.arrow(data)
    return data.arrow().to_pandas()

Live in streamlit cloud

Rest of github code

Kareem_Rasheed_babat · April 29, 2022, 5:46am

Great work

system · April 29, 2023, 5:47am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unclear if I am using Duckdb and Streamlit to its full potential Using Streamlit pandas , database , dataframe , debugging	6	1329	October 11, 2024
Improving Data Load Performance Using Streamlit pandas	3	3945	November 19, 2021
Ducklit - query files in Streamlit with DuckDB Show the Community! file-upload , streamlit-cloud , database , build-with-streamlit	2	3298	October 9, 2024
Filter large Pandas dataframes in Streamlit? (Pandas + boolean filters? Arrow? SQL? Punt?) Using Streamlit pandas	7	4940	May 13, 2022
Is there an example of using the `Arrow` data structure without using `Pandas`? Using Streamlit	5	2048	January 7, 2023

Holy Duck! Full Uber Pickups dataset with DuckDB + Pyarrow

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies