Is there a way to quickly show big data based on input from user?

ekmekreis · August 18, 2022, 12:59pm

Hello everyone! I am relatively new on Streamlit. I am sorry if my issue is silly.

However, I have concatenated big data (50 csv files) with around 30 million rows and 15 columns. The data size is approximately 3.2 gigabytes. I am trying to do a project which will search column values by Streamlit’s text input and str.contains() method and show to the last user via st.dataframe(). My code is below;

input = st.text_input(label = ‘Search the Receiver’)
selected_receiver = df[‘Receivers’].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver])

The problem is when I concat the csv files with pandas execution takes around 8 mins with @st.cache(hash_funcs={pd.DataFrame: lambda _: None}) and the first search’s execution takes around 6 mins. After the first search, the other queries take around 1 min.

When I concat the csv files with dask execution takes around 2-3 seconds but nothing shows on the screen. I’ve waited 40 mins, I only see running on the right corner. I’ve tried the compute command like below but didn’t work.

selected_receiver = df[‘Receivers’].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver].compute())

Long story short, Is there an efficient way to quickly show big data based on input from user, or Do I need a cloud service like Coiled for gb sized data frames?

blackary · August 18, 2022, 3:53pm

I think you shouldn’t need to do anything too special. I tried reproducing something similar, maybe with a few differences/suggestions.

Are you able to pre-concatenate the 50 csvs, or do you have to do that every time? I suspect you should be able to speed things up a lot by concatenating first, and saving the resulting file. Are the individual files 30M rows and 3.2GB, or the total? If the individual files are actually 3.2GB, then you’re probably running into RAM limitations, and you won’t be able to do much better without running a separate server/cluster to do the calculations for you. If the total is 3.2GB, you should be able to handle that locally reasonably well.
You can use the newer st.experimental_memo and st.experimental_singleton for your caching.

Here’s my attempt to reproduce something similar, with a randomly generated dataset, ending up being 1.5GB (only containing integers). For this case, loading the initial csv takes ~12s on my machine, and doing a search takes ~13s.

import time

import numpy as np
import pandas as pd
import streamlit as st


def _make_csv():
    n_rows = 30_000_000
    n_cols = 15
    df = pd.DataFrame(
        np.random.randint(0, 100, size=(n_rows, n_cols)),
        columns=["col%d" % i for i in range(n_cols)],
    )
    df.to_csv("sample.csv", index=False)


@st.experimental_singleton
def get_data() -> pd.DataFrame:
    start = time.time()
    df = pd.read_csv("sample_csv.csv")
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time loading data:", elapsed)
    return df


@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
    df = get_data()
    start = time.time()
    filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time:", elapsed)
    return filtered


input = st.text_input(label="Search the Receiver")

if input:
    filtered = search_data(input)
    st.dataframe(filtered)
    st.write("Loaded", len(filtered), "rows")
import time

import numpy as np
import pandas as pd
import streamlit as st


def _make_csv():
    n_rows = 30_000_000
    n_cols = 15
    df = pd.DataFrame(
        np.random.randint(0, 100, size=(n_rows, n_cols)),
        columns=["col%d" % i for i in range(n_cols)],
    )
    df.to_csv("sample.csv", index=False)


@st.experimental_singleton
def get_data() -> pd.DataFrame:
    start = time.time()
    df = pd.read_csv("sample_csv.csv")
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time loading data:", elapsed)
    return df


@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
    df = get_data()
    start = time.time()
    filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time:", elapsed)
    return filtered


input = st.text_input(label="Search the Receiver")

if input:
    filtered = search_data(input)
    st.dataframe(filtered)
    st.write("Loaded", len(filtered), "rows")

ekmekreis · August 19, 2022, 8:16am

Hi @blackary

Thank you for your suggestions. I’ve pre-spliced csv files and tried via @st.experimental_singleton It worked like a charm appreciate it, thank you for your help again!

system · August 19, 2023, 8:16am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CSV over 2 GB Using Streamlit	9	4569	January 12, 2022
St.dataframe can only handle 50mb Using Streamlit	2	968	January 12, 2022
Scalabilty of streamlit for pandas Using Streamlit cache , pandas	4	9008	November 19, 2021
Unclear if I am using Duckdb and Streamlit to its full potential Using Streamlit pandas , database , dataframe , debugging	6	1255	October 11, 2024
Filter large Pandas dataframes in Streamlit? (Pandas + boolean filters? Arrow? SQL? Punt?) Using Streamlit pandas	7	4886	May 13, 2022

Is there a way to quickly show big data based on input from user?

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies