Context:
Working on 20 to 30 multi page streamlit dashboard. Hence speed is becoming critical.
Currently running app locally. Cannot share the link of the dashboard.
import duckdb
import pandas as pd
import polars as pl
import time
conn = duckdb.connect()
# dummy.csv refers to a file that I created with 100 million rows for testing.
3 gb dataset.
query = \
"""
CREATE TABLE dummy as (
select * from 'dummy.csv'
)
"""
# Creates table
conn.execute(query)
# Part 1: Testing with Duck DB show()
start_time = time.time()
query = \
"""
select *
from dummy
"""
df = conn.sql(query).show()
df
print("--- %s seconds ---" % (time.time() - start_time))
Took my machine 0.013511896133422852 seconds. (But I cannot feed this dataframe to my streamlit application)
So currently I am converting it to either pandas or polars, which have been slower than duckdb and I am not liking the workflow.
Hello @napster_8l, and welcome to the Streamlit Community!
Two things come to mind to improve performance: first, you might consider optimizing your DuckDB queries, which includes using proper indexing, avoiding “select *” when not necessary, and pre-filtering data as much as possible.
You could also use DuckDB’s capability to directly query CSV files without loading them into a table if the data doesn’t need to persist.
Thank you for noticing the intricate details like those.
I get that I shouldn’t be doing the
select * and writing efficient queries, etc. But it was just done testing out where exactly am i being inefficient. And I found out that its when I am converting the aggregated results into pandas dataframe and in my use case there are simply alot of such lines in code where i was just converting to pandas dataframes.
I just came here to learn IF there are any efficient ways of display dataframes without having to convert to polars or arrow table or pandas dataframe.