Filter large Pandas dataframes in Streamlit? (Pandas + boolean filters? Arrow? SQL? Punt?)

Jeremy · February 12, 2021, 11:10pm

I have a relatively large Arrow dataframe ~1.5M rows and a negotiable number of columns (but probably > 10 in any rate). I’ve been using Streamlit to convert UI elements to boolean Pandas filters that are then combined at the end to return the “final” data frame. This works splendidly for smaller, sample data but it really does not seem to scale up (Note: I don’t need to display a large subset of the dataframe. I’m currently using one of the hacks recommended here to chunk it up for display)

Just wondering what, if any, is the right way to proceed with dealing with large DFs in general. A few ideas I’ve thought of are:

Use Arrow to load the subsetted data: This is good, but while the filter-on-disk aspects of Arrow are growing, they’re nowhere near as powerful as Pandas, and I’d have to re-write all of my filters.
Use SQL: The same issue as 1, except the filtering is much more robust. I’m guessing this may be the answer, and I’m just going to have to throw away all my [sigh, carefully crafted] streamlit UI-to-DF-filter work
Someone has magic advice that helps this scale better! (Hopefully??!)

mkleinbort · February 13, 2021, 12:02am

Try using method chaining

e.g. if the filters are

df['age'].between(10,50)
df['name'] != 'Bob'
df['city'] == 'London'

Instead of

a = df['age'].between(10,50)
b = df['name'] != 'Bob'
c = df['city'] == 'London'

ans = df[a & b & c]

Try

ans = (df
       .loc[lambda x: x['age'].between(10,50)]
       .loc[lambda x: x['name'] != 'Bob']
       .loc[lamdba x: x['city'] == 'London']
)

This processes fewer rows after every filter.

Or use something like Dask ?

I’ve had good results with .parquet files + dask.

Jeremy · February 15, 2021, 2:46am

That’s interesting! I’ll see if that helps.

I’m actually not sure if the filtering is where the bottleneck is, or if it’s the initial load of the [full] data frame from disk. My guess is it’s the latter, in which case I may have to use a combination of things – or abandon Pandas altogether – to keep the loads slimmer. But it really is disheartening to invest so much in to this one solution and have it not only not scale, but not scale for just one connection! I shudder to think how it’d do actually deployed on a server

But I’ll definitely give that a shot and would love to hear any other advice!

randyzwitch · February 16, 2021, 4:18pm

Hi @Jeremy, welcome to the Streamlit community!

I think @mkleinbort is on the right track (hi, welcome to the Streamlit community also! ) with method chaining, but I’m also surprised that the data loading is so slow. Any chance you can share the code so I can get a better understanding of what sort of processing is happening?

Best,
Randy

Jeremy · February 17, 2021, 12:34am

Thanks for the reply, @randyzwitch . I’ll note that I need to update to the latest version of Streamlit, and I haven’t implemented the chaining yet to see if that helps. ButI just ran things again and it basically ground to a halt, so I’m pretty sure I’m doing something wrong!

Here’s some code I use to load the initial set of dataframes

@st.cache
def load_parquet(years):
    # years comes from a st.slider
    files = list()
    for year in range(years[0], years[1]+1):
        # Each parquet file is O(15-20) MB
        filename = f"/path/to/parquet/files/{year}.parquet"
        # data_cols is O(20) or so columns
        df = pd.read_parquet(filename, columns=data_cols)
        files.append(df)
    df = pd.concat(files)
    df = df.reset_index()
    return df

My [terrible] spidey sense is telling me this might be a bad idea. My hope was to only load the data for the years needed, but maybe it just makes sense to throw it all into one big DF and load it regardless?

(I’ll note that I basically do this load right after the years slider, but before all the rest of the selection components. It feels like this should be okay, since it’s cached and the data shouldn’t be reloaded unless someone changes the year selection but maybe not?)

I use the Streamlit UI components to generate masks. These vary widely (sliders, multi selects, checkboxes, etc.), but they’re basically relatively straightforward things like

names = st.sidebar.multiselect('Select names', helper.ALL_NAMES, default=default_name, key='name_select')
masks['names'] = data['names'].isin(names)`

(noting to be careful to keep the multiselect options to < 150 or so). I then apply all my masks with:

# Mask aggregator
def aggregate_masks(data, masks, operator='and'):
    if operator == 'and':
        return data[reduce(np.logical_and, masks.values())]
    elif operator =='or':
        return data[reduce(np.logical_or, masks.values())]

and then I use a hack I found on these forums to display a subset of the “final” DF

ret = aggregate_masks(data, masks)

page_size = 1000
page_number = st.number_input(
    label="Page Number",
    min_value=1,
    max_value=math.ceil(len(ret)/page_size),
    step=1,
)
current_start = (page_number-1)*page_size
current_end = page_number*page_size if len(ret) > page_number*page_size else len(ret)

st.write(ret[current_start:current_end])
st.write(f"DIsplaying rows {current_start} to {current_end}")

That’s basically it as far as [semi-obvious?] problem spots. As I mentioned, the rest is just UI-to-mask conversion (that I spent a fair amount of time on, and would love to keep vs. converting it all to, say, SQL queries or what have you).

So my hope is that I’m just doing something dumb with the initial loading of the data, and/or the shift to chaining will speed things up. But if anyone sees anything else that’s out of whack, I’d love to know about it.

Thanks for the help!

PeterT · February 17, 2021, 7:19am

Hi Jeremy,

To my eyes your filtering looks quite solid. I have a few pointers below which may or may not help you improve the speed

Regarding the years slider I think it can be a good idea, although it can in some cases also be beneficial to just do the data loading once and then perhaps control the years through filters on the loaded data. Be aware though that the slider executes not on mouserelease, but on change and thus using a slider can make the function trigger an unwanted amount of times, perhaps a number input would serve the purpose better.

I’ve also found that adding an st.cache on a function like your mask aggregator can result in significant speedup, but of course at the cost of memory. In addition, you may consider splitting the aggregate_masks function up such that you have a function for creating the masks, i.e. returning something like reduce(np.logical_and, masks.values()) and a function to apply this filtering to the data, and make sure that the final mask is calculated before applying the filtering to data.

Finally, I found it extremely helpful myself to time the execution of individual codesnippets and/or functions to narrow down my issues and found bottlenecks where I would’ve never expected them… Perhaps the two ways of making timers below can be helpful to you.


def timer(func, enabled=ENABLE_TIMER):
    """Timer decorator which decorates a function such that it logs time spend.

    Usage:
    @timer
    def my_function():
        do stuff

    see also https://realpython.com/python-timer/#a-python-timer-decorator
    """
    if not enabled:
        return func

    @wraps(func)
    def _time_it(*args, **kwargs):
        start = int(round(time() * 1000))
        try:
            return func(*args, **kwargs)
        finally:
            end_ = int(round(time() * 1000)) - start
            logger.info(
                f"{func.__name__} - execution time: {end_ if end_ > 0 else 0} ms"
            )

    return _time_it


class TimerError(Exception):
    """A custom exception used to report errors in use of Timer class"""


@dataclass
class Timer:
    """Timer to be used when timing something else than functions
    usage: time
    with Timer(logger=logger, text="Doing stuff")
        ...do stuff

    see also https://realpython.com/python-timer/

    """

    timers: ClassVar[Dict[str, float]] = dict()
    name: Optional[str] = None
    logger: Optional[Callable[[str], None]] = print
    _start_time: Optional[float] = field(default=None, init=False, repr=False)

    def __init__(self, logger, text="") -> None:
        self.logger = logger
        self.text = text

    def create_text(self, text, elapsed):
        return f"{text} - Elapsed time: {elapsed:0.4f} seconds"

    def __post_init__(self) -> None:
        """Add timer to dict of timers after initialization"""
        if self.name is not None:
            self.timers.setdefault(self.name, 0)

    def start(self) -> None:
        """Start a new timer"""
        if self._start_time is not None:
            raise TimerError(f"Timer is running. Use .stop() to stop it")

        self._start_time = perf_counter()

    def stop(self) -> float:
        """Stop the timer, and report the elapsed time"""
        if self._start_time is None:
            raise TimerError(f"Timer is not running. Use .start() to start it")

        # Calculate elapsed time
        elapsed_time = perf_counter() - self._start_time
        self._start_time = None

        # Report elapsed time
        if self.logger:
            self.logger.info(self.create_text(self.text, elapsed_time))
        if self.name:
            self.timers[self.name] += elapsed_time

        return elapsed_time

    def __enter__(self):
        """Start a new timer as a context manager"""
        self.start()
        return self

    def __exit__(self, *exc_info):
        """Stop the context manager timer"""
        self.stop()

Be aware that the examples use a ‘logger’ class. You get a logger fromthe python standard library, something like

import logging
logger = logging.getLogger("timing_logger")

will do the trick I think. Otherwise you can replace the logging with print statements or perhaps even st.write statements

Best regards,
Peter

randyzwitch · February 17, 2021, 2:55pm

If I had to guess, I suspect this operation might be more expensive than you think. I would think this would create a lot of internal copies (this is a pandas quirk), which could be using more time and memory than your original data would suggest.

For 15-20MB parquet files, I would expect there would be near instant load times. @PeterT 's post about timing individual parts seems like a good next step, to really understand the performance characteristics at play here.

Best,
Randy

Topic		Replies	Views
Paginate large dataframe ("server side") AND use dynamic widgets for dataframe filtering? Using Streamlit session-state , pandas	1	1491	November 1, 2022
Filter dataframe by selections made in select box Using Streamlit pandas	15	53552	December 26, 2023
Please help a confused guy with Streamlit/DataFrames/Filter - Updated Using Streamlit	5	611	June 26, 2024
Scalabilty of streamlit for pandas Using Streamlit cache , pandas	4	8980	November 19, 2021
Auto-generate a dataframe filtering UI in Streamlit with filter_dataframe! Show the Community!	12	3691	November 28, 2024

Filter large Pandas dataframes in Streamlit? (Pandas + boolean filters? Arrow? SQL? Punt?)

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies