Is there a way to quickly show big data based on input from user?

Hello everyone! I am relatively new on Streamlit. I am sorry if my issue is silly.

However, I have concatenated big data (50 csv files) with around 30 million rows and 15 columns. The data size is approximately 3.2 gigabytes. I am trying to do a project which will search column values by Streamlit’s text input and str.contains() method and show to the last user via st.dataframe(). My code is below;

input = st.text_input(label = ‘Search the Receiver’)
selected_receiver = df[‘Receivers’].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver])

The problem is when I concat the csv files with pandas execution takes around 8 mins with @st.cache(hash_funcs={pd.DataFrame: lambda _: None}) and the first search’s execution takes around 6 mins. After the first search, the other queries take around 1 min.

When I concat the csv files with dask execution takes around 2-3 seconds but nothing shows on the screen. I’ve waited 40 mins, I only see running on the right corner. I’ve tried the compute command like below but didn’t work.

selected_receiver = df[‘Receivers’].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver].compute())

Long story short, Is there an efficient way to quickly show big data based on input from user, or Do I need a cloud service like Coiled for gb sized data frames?

I think you shouldn’t need to do anything too special. I tried reproducing something similar, maybe with a few differences/suggestions.

  1. Are you able to pre-concatenate the 50 csvs, or do you have to do that every time? I suspect you should be able to speed things up a lot by concatenating first, and saving the resulting file. Are the individual files 30M rows and 3.2GB, or the total? If the individual files are actually 3.2GB, then you’re probably running into RAM limitations, and you won’t be able to do much better without running a separate server/cluster to do the calculations for you. If the total is 3.2GB, you should be able to handle that locally reasonably well.

  2. You can use the newer st.experimental_memo and st.experimental_singleton for your caching.

Here’s my attempt to reproduce something similar, with a randomly generated dataset, ending up being 1.5GB (only containing integers). For this case, loading the initial csv takes ~12s on my machine, and doing a search takes ~13s.

import time

import numpy as np
import pandas as pd
import streamlit as st


def _make_csv():
    n_rows = 30_000_000
    n_cols = 15
    df = pd.DataFrame(
        np.random.randint(0, 100, size=(n_rows, n_cols)),
        columns=["col%d" % i for i in range(n_cols)],
    )
    df.to_csv("sample.csv", index=False)


@st.experimental_singleton
def get_data() -> pd.DataFrame:
    start = time.time()
    df = pd.read_csv("sample_csv.csv")
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time loading data:", elapsed)
    return df


@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
    df = get_data()
    start = time.time()
    filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time:", elapsed)
    return filtered


input = st.text_input(label="Search the Receiver")

if input:
    filtered = search_data(input)
    st.dataframe(filtered)
    st.write("Loaded", len(filtered), "rows")
import time

import numpy as np
import pandas as pd
import streamlit as st


def _make_csv():
    n_rows = 30_000_000
    n_cols = 15
    df = pd.DataFrame(
        np.random.randint(0, 100, size=(n_rows, n_cols)),
        columns=["col%d" % i for i in range(n_cols)],
    )
    df.to_csv("sample.csv", index=False)


@st.experimental_singleton
def get_data() -> pd.DataFrame:
    start = time.time()
    df = pd.read_csv("sample_csv.csv")
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time loading data:", elapsed)
    return df


@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
    df = get_data()
    start = time.time()
    filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time:", elapsed)
    return filtered


input = st.text_input(label="Search the Receiver")

if input:
    filtered = search_data(input)
    st.dataframe(filtered)
    st.write("Loaded", len(filtered), "rows")
2 Likes

Hi @blackary

Thank you for your suggestions. I’ve pre-spliced csv files and tried via @st.experimental_singleton It worked like a charm appreciate it, thank you for your help again!

1 Like