Is there a way to quickly show big data based on input from user?

Hello everyone! I am relatively new on Streamlit. I am sorry if my issue is silly.

However, I have concatenated big data (50 csv files) with around 30 million rows and 15 columns. The data size is approximately 3.2 gigabytes. I am trying to do a project which will search column values by Streamlitā€™s text input and str.contains() method and show to the last user via st.dataframe(). My code is below;

input = st.text_input(label = ā€˜Search the Receiverā€™)
selected_receiver = df[ā€˜Receiversā€™].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver])

The problem is when I concat the csv files with pandas execution takes around 8 mins with @st.cache(hash_funcs={pd.DataFrame: lambda _: None}) and the first searchā€™s execution takes around 6 mins. After the first search, the other queries take around 1 min.

When I concat the csv files with dask execution takes around 2-3 seconds but nothing shows on the screen. Iā€™ve waited 40 mins, I only see running on the right corner. Iā€™ve tried the compute command like below but didnā€™t work.

selected_receiver = df[ā€˜Receiversā€™].str.contains(input.upper(), na = False)
st.dataframe(df[selected_receiver].compute())

Long story short, Is there an efficient way to quickly show big data based on input from user, or Do I need a cloud service like Coiled for gb sized data frames?

I think you shouldnā€™t need to do anything too special. I tried reproducing something similar, maybe with a few differences/suggestions.

  1. Are you able to pre-concatenate the 50 csvs, or do you have to do that every time? I suspect you should be able to speed things up a lot by concatenating first, and saving the resulting file. Are the individual files 30M rows and 3.2GB, or the total? If the individual files are actually 3.2GB, then youā€™re probably running into RAM limitations, and you wonā€™t be able to do much better without running a separate server/cluster to do the calculations for you. If the total is 3.2GB, you should be able to handle that locally reasonably well.

  2. You can use the newer st.experimental_memo and st.experimental_singleton for your caching.

Hereā€™s my attempt to reproduce something similar, with a randomly generated dataset, ending up being 1.5GB (only containing integers). For this case, loading the initial csv takes ~12s on my machine, and doing a search takes ~13s.

import time

import numpy as np
import pandas as pd
import streamlit as st


def _make_csv():
    n_rows = 30_000_000
    n_cols = 15
    df = pd.DataFrame(
        np.random.randint(0, 100, size=(n_rows, n_cols)),
        columns=["col%d" % i for i in range(n_cols)],
    )
    df.to_csv("sample.csv", index=False)


@st.experimental_singleton
def get_data() -> pd.DataFrame:
    start = time.time()
    df = pd.read_csv("sample_csv.csv")
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time loading data:", elapsed)
    return df


@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
    df = get_data()
    start = time.time()
    filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time:", elapsed)
    return filtered


input = st.text_input(label="Search the Receiver")

if input:
    filtered = search_data(input)
    st.dataframe(filtered)
    st.write("Loaded", len(filtered), "rows")
import time

import numpy as np
import pandas as pd
import streamlit as st


def _make_csv():
    n_rows = 30_000_000
    n_cols = 15
    df = pd.DataFrame(
        np.random.randint(0, 100, size=(n_rows, n_cols)),
        columns=["col%d" % i for i in range(n_cols)],
    )
    df.to_csv("sample.csv", index=False)


@st.experimental_singleton
def get_data() -> pd.DataFrame:
    start = time.time()
    df = pd.read_csv("sample_csv.csv")
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time loading data:", elapsed)
    return df


@st.experimental_memo
def search_data(search_term: str) -> pd.DataFrame:
    df = get_data()
    start = time.time()
    filtered = df[df["col1"].astype("str").str.contains(search_term, na=False)]
    end = time.time()
    elapsed = end - start
    st.write("Elapsed time:", elapsed)
    return filtered


input = st.text_input(label="Search the Receiver")

if input:
    filtered = search_data(input)
    st.dataframe(filtered)
    st.write("Loaded", len(filtered), "rows")
2 Likes

Hi @blackary

Thank you for your suggestions. Iā€™ve pre-spliced csv files and tried via @st.experimental_singleton It worked like a charm appreciate it, thank you for your help again!

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.