Unclear if I am using Duckdb and Streamlit to its full potential

Context:
Working on 20 to 30 multi page streamlit dashboard. Hence speed is becoming critical.

  1. Currently running app locally. Cannot share the link of the dashboard.
import duckdb
import pandas as pd
import polars as pl
import time

conn = duckdb.connect()

# dummy.csv refers to a file that I created with 100 million rows for testing. 
3 gb dataset.

query = \
"""
CREATE TABLE dummy as (
select * from 'dummy.csv'
)
"""

# Creates table
conn.execute(query)

# Part 1: Testing with Duck DB show()
start_time = time.time()


query = \
"""
select * 
from dummy
"""

df = conn.sql(query).show()
df 
print("--- %s seconds ---" % (time.time() - start_time))

Took my machine 0.013511896133422852 seconds. (But I cannot feed this dataframe to my streamlit application)

So currently I am converting it to either pandas or polars, which have been slower than duckdb and I am not liking the workflow.

Converting to pandas dataframe

df = conn.sql(query).df()
df
print("--- %s seconds ---" % (time.time() - start_time))

Takes 6 seconds

Converting to polars dataframe

df = conn.execute(query).pl()
display(df)
print("--- %s seconds ---" % (time.time() - start_time))

Takes 2 seconds.

Are there faster ways of just directly using duckdb’s magic to show dataframes in the:

st.dataframe(df)

Hello @napster_8l, and welcome to the Streamlit Community! :hugs:

Two things come to mind to improve performance: first, you might consider optimizing your DuckDB queries, which includes using proper indexing, avoiding “select *” when not necessary, and pre-filtering data as much as possible.

You could also use DuckDB’s capability to directly query CSV files without loading them into a table if the data doesn’t need to persist.

I hope that helps.

Best,
Charly

1 Like

By the way, you can also use st.cache_data to cache data-intensive functions. If you send me the full code, I can help devise further improvements :slight_smile:

Charly

1 Like

Thank you for noticing the intricate details like those.

I get that I shouldn’t be doing the

select * and writing efficient queries, etc. But it was just done testing out where exactly am i being inefficient. And I found out that its when I am converting the aggregated results into pandas dataframe and in my use case there are simply alot of such lines in code where i was just converting to pandas dataframes.

I just came here to learn IF there are any efficient ways of display dataframes without having to convert to polars or arrow table or pandas dataframe.

How long does it take creating the DataFrame, without showing it?

df = conn.sql(query).df()

See the docs for details on what objects can be passed to st.dataframe().

df() method is taking 6 seconds.

Definitely not the optimal solution out there, so I would stick with pl() method for now.

learnt that .show() does not really execute the query its just a lazy showcase of the query answer.

Only when df() or pl() or arrow() is called is when heavy operation is performed.

among them it turns out that arrow() and pl() are significantly better in terms of performance.