Unclear if I am using Duckdb and Streamlit to its full potential

napster_8l · April 14, 2024, 3:56am

Context:
Working on 20 to 30 multi page streamlit dashboard. Hence speed is becoming critical.

Currently running app locally. Cannot share the link of the dashboard.

import duckdb
import pandas as pd
import polars as pl
import time

conn = duckdb.connect()

# dummy.csv refers to a file that I created with 100 million rows for testing. 
3 gb dataset.

query = \
"""
CREATE TABLE dummy as (
select * from 'dummy.csv'
)
"""

# Creates table
conn.execute(query)

# Part 1: Testing with Duck DB show()
start_time = time.time()


query = \
"""
select * 
from dummy
"""

df = conn.sql(query).show()
df 
print("--- %s seconds ---" % (time.time() - start_time))

Took my machine 0.013511896133422852 seconds. (But I cannot feed this dataframe to my streamlit application)

So currently I am converting it to either pandas or polars, which have been slower than duckdb and I am not liking the workflow.

Converting to pandas dataframe

df = conn.sql(query).df()
df
print("--- %s seconds ---" % (time.time() - start_time))

Takes 6 seconds

Converting to polars dataframe

df = conn.execute(query).pl()
display(df)
print("--- %s seconds ---" % (time.time() - start_time))

Takes 2 seconds.

Are there faster ways of just directly using duckdb’s magic to show dataframes in the:

st.dataframe(df)

Charly_Wargnier · April 14, 2024, 7:30am

Hello @napster_8l, and welcome to the Streamlit Community!

Two things come to mind to improve performance: first, you might consider optimizing your DuckDB queries, which includes using proper indexing, avoiding “select *” when not necessary, and pre-filtering data as much as possible.

You could also use DuckDB’s capability to directly query CSV files without loading them into a table if the data doesn’t need to persist.

I hope that helps.

Best,
Charly

Charly_Wargnier · April 14, 2024, 8:26am

By the way, you can also use st.cache_data to cache data-intensive functions. If you send me the full code, I can help devise further improvements

Charly

napster_8l · April 14, 2024, 12:00pm

Thank you for noticing the intricate details like those.

I get that I shouldn’t be doing the

select * and writing efficient queries, etc. But it was just done testing out where exactly am i being inefficient. And I found out that its when I am converting the aggregated results into pandas dataframe and in my use case there are simply alot of such lines in code where i was just converting to pandas dataframes.

I just came here to learn IF there are any efficient ways of display dataframes without having to convert to polars or arrow table or pandas dataframe.

Goyo · April 14, 2024, 5:56pm

How long does it take creating the DataFrame, without showing it?

df = conn.sql(query).df()

See the docs for details on what objects can be passed to st.dataframe().

napster_8l · April 14, 2024, 6:45pm

df() method is taking 6 seconds.

Definitely not the optimal solution out there, so I would stick with pl() method for now.

learnt that .show() does not really execute the query its just a lazy showcase of the query answer.

Only when df() or pl() or arrow() is called is when heavy operation is performed.

among them it turns out that arrow() and pl() are significantly better in terms of performance.

system · October 11, 2024, 6:46pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Improving Data Load Performance Using Streamlit pandas	3	3828	November 19, 2021
Reading data quickly Using Streamlit	7	3108	January 12, 2022
Looking for a hands on tutor who can help me with Streamlit and interactions with data frames Jobs	2	502	June 20, 2024
Optimizing Streamlit Application Using Streamlit cache , windows , session-state , pandas , plotly , streamlit-elements , discussion	4	200	November 14, 2024
Data loading is very slow Using Streamlit	3	1610	May 13, 2022

Unclear if I am using Duckdb and Streamlit to its full potential

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies