Scalabilty of streamlit for pandas

Hi I wanted to now how many rows streamlit is able to render using pandas dataframe. How about around 100,000 rows or so ?

Absolutely, here is a gist that demonstrates this:

import streamlit as st
import pandas as pd

DATA_URL = DATA_BUCKET + "uber-raw-data-sep14.csv.gz"
read_and_cache_csv = st.cache(pd.read_csv)
data = read_and_cache_csv(DATA_URL, nrows=100000)
st.write('Data', data)

In fact, you try this gist right now by running:

streamlit run

Note: Streamlit currently will send the entire DataFrame over the wire which might be slow, especially the first time you run the report. This is an problem we’ve discussed and I created an issue to track it. Please feel free to follow this issue on Github!

Usually, displaying all 100,000 rows of a DataFrame is unnecessary and we suggest using interactive widgets to filter such DataFrames before sending them to the user. Here is an example of how to do that which you can also try right now:

streamlit run

How much beyond 100,000 rows should we expect it to be able to scale? Should 1 billion rows work? Or 100 billion rows?

And will it run into issues with caching?

I wasn’t able to get my app to load and cache a 200 billion line csv file, so I tried reducing the size, but even at around 500,000 rows I’m having trouble, with, for example st.write(df). It seems to be able to load and cache the csv file, but then when I tried to do something with the dataframe it hangs and the app screen goes blank.

Hi @Yasha. Welcome to the community! :hugs:

Yes. Large dataframes can slow down a Streamlit App! In genera displaying more than 100k elements can start to get sluggish.

Please note that this doesn’t mean that Streamlit can’t be used with huge datasets, only that you can’t quickly display those datasets directly to the screen with st.write or st.dataframe. The good news is that usually you don’t actually want to send that many elements to the browser! :sunflower:

Instead you might want to:

  1. Display a quick-and-dirty subset to get an idea for your data.
  1. Write a little filter UI for your data and only display the subset the user wants to see. An example of such a UI is shown here in the Udacity dataset demo.

  2. Use something like display_dataframe_quickly defined in this gist and reproduced here:

def display_dataframe_quickly(df, max_rows=5000, **st_dataframe_kwargs):
    """Display a subset of a DataFrame or Numpy Array to speed up app renders.
    df : DataFrame | ndarray
        The DataFrame or NumpyArray to render.
    max_rows : int
        The number of rows to display.
    st_dataframe_kwargs : Dict[Any, Any]
        Keyword arguments to the st.dataframe method.
    n_rows = len(df)
    if n_rows <= max_rows:
        # As a special case, display small dataframe directly.
        # Slice the DataFrame to display less information.
        start_row = st.slider('Start row', 0, n_rows - max_rows)
        end_row = start_row + max_rows
        df = df[start_row:end_row]

        # Reindex Numpy arrays to make them more understadable.
        if type(df) == np.ndarray:
            df = pd.DataFrame(df)
            df.index = range(start_row,end_row)

        # Display everything.
        st.dataframe(df, **st_dataframe_kwargs)
        st.text('Displaying rows %i to %i of %i.' % (start_row, end_row - 1, n_rows))

To test this method you can run:

streamlit run

You should see this:

Of course the magic of Streamlit is that it usually “just works.” Therefore we are also working on improvements to make Streamlit faster for large DataFrames. In addition to the fix referenced above, we’re also considering using compression for Streamlit packets, or increasing responsiveness by showing a progress bar when sending large packets.

Please feel free to comment on or follow any of these issues for up-to-date information.