Hello everyone! I am relatively new on Streamlit. I am sorry if my issue is silly.
However, I have concatenated big data (50 csv files) with around 30 million rows and 15 columns. The data size is approximately 3.2 gigabytes. I am trying to do a project which will search column values by Streamlit’s text input and str.contains() method and show to the last user via st.dataframe(). My code is below;
input = st.text_input(label = ‘Search the Receiver’)
selected_receiver = df[‘Receivers’].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver])
The problem is when I concat the csv files with pandas execution takes around 8 mins with @st.cache(hash_funcs={pd.DataFrame: lambda _: None}) and the first search’s execution takes around 6 mins. After the first search, the other queries take around 1 min.
When I concat the csv files with dask execution takes around 2-3 seconds but nothing shows on the screen. I’ve waited 40 mins, I only see running on the right corner. I’ve tried the compute command like below but didn’t work.
selected_receiver = df[‘Receivers’].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver].compute())
Long story short, Is there an efficient way to quickly show big data based on input from user, or Do I need a cloud service like Coiled for gb sized data frames?
I think you shouldn’t need to do anything too special. I tried reproducing something similar, maybe with a few differences/suggestions.
Are you able to pre-concatenate the 50 csvs, or do you have to do that every time? I suspect you should be able to speed things up a lot by concatenating first, and saving the resulting file. Are the individual files 30M rows and 3.2GB, or the total? If the individual files are actually 3.2GB, then you’re probably running into RAM limitations, and you won’t be able to do much better without running a separate server/cluster to do the calculations for you. If the total is 3.2GB, you should be able to handle that locally reasonably well.
You can use the newer st.experimental_memo and st.experimental_singleton for your caching.
Here’s my attempt to reproduce something similar, with a randomly generated dataset, ending up being 1.5GB (only containing integers). For this case, loading the initial csv takes ~12s on my machine, and doing a search takes ~13s.
import time
import numpy as np
import pandas as pd
import streamlit as st
def _make_csv():
n_rows = 30_000_000
n_cols = 15
df = pd.DataFrame(
np.random.randint(0, 100, size=(n_rows, n_cols)),
columns=["col%d" % i for i in range(n_cols)],
)
df.to_csv("sample.csv", index=False)
@st.experimental_singleton
def get_data() -> pd.DataFrame:
start = time.time()
df = pd.read_csv("sample_csv.csv")
end = time.time()
elapsed = end - start
st.write("Elapsed time loading data:", elapsed)
return df
@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
df = get_data()
start = time.time()
filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
end = time.time()
elapsed = end - start
st.write("Elapsed time:", elapsed)
return filtered
input = st.text_input(label="Search the Receiver")
if input:
filtered = search_data(input)
st.dataframe(filtered)
st.write("Loaded", len(filtered), "rows")
import time
import numpy as np
import pandas as pd
import streamlit as st
def _make_csv():
n_rows = 30_000_000
n_cols = 15
df = pd.DataFrame(
np.random.randint(0, 100, size=(n_rows, n_cols)),
columns=["col%d" % i for i in range(n_cols)],
)
df.to_csv("sample.csv", index=False)
@st.experimental_singleton
def get_data() -> pd.DataFrame:
start = time.time()
df = pd.read_csv("sample_csv.csv")
end = time.time()
elapsed = end - start
st.write("Elapsed time loading data:", elapsed)
return df
@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
df = get_data()
start = time.time()
filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
end = time.time()
elapsed = end - start
st.write("Elapsed time:", elapsed)
return filtered
input = st.text_input(label="Search the Receiver")
if input:
filtered = search_data(input)
st.dataframe(filtered)
st.write("Loaded", len(filtered), "rows")
Thank you for your suggestions. I’ve pre-spliced csv files and tried via @st.experimental_singleton It worked like a charm appreciate it, thank you for your help again!
Thanks for stopping by! We use cookies to help us understand how you interact with our website.
By clicking “Accept all”, you consent to our use of cookies. For more information, please see our privacy policy.
Cookie settings
Strictly necessary cookies
These cookies are necessary for the website to function and cannot be switched off. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms.
Performance cookies
These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us understand how visitors move around the site and which pages are most frequently visited.
Functional cookies
These cookies are used to record your choices and settings, maintain your preferences over time and recognize you when you return to our website. These cookies help us to personalize our content for you and remember your preferences.
Targeting cookies
These cookies may be deployed to our site by our advertising partners to build a profile of your interest and provide you with content that is relevant to you, including showing you relevant ads on other websites.