Filter large Pandas dataframes in Streamlit? (Pandas + boolean filters? Arrow? SQL? Punt?)

I have a relatively large Arrow dataframe ~1.5M rows and a negotiable number of columns (but probably > 10 in any rate). I’ve been using Streamlit to convert UI elements to boolean Pandas filters that are then combined at the end to return the “final” data frame. This works splendidly for smaller, sample data but it really does not seem to scale up (Note: I don’t need to display a large subset of the dataframe. I’m currently using one of the hacks recommended here to chunk it up for display)

Just wondering what, if any, is the right way to proceed with dealing with large DFs in general. A few ideas I’ve thought of are:

  1. Use Arrow to load the subsetted data: This is good, but while the filter-on-disk aspects of Arrow are growing, they’re nowhere near as powerful as Pandas, and I’d have to re-write all of my filters.
  2. Use SQL: The same issue as 1, except the filtering is much more robust. I’m guessing this may be the answer, and I’m just going to have to throw away all my [sigh, carefully crafted] streamlit UI-to-DF-filter work
  3. Someone has magic advice that helps this scale better! (Hopefully??!)

Try using method chaining

e.g. if the filters are

df['age'].between(10,50)
df['name'] != 'Bob'
df['city'] == 'London'

Instead of

a = df['age'].between(10,50)
b = df['name'] != 'Bob'
c = df['city'] == 'London'

ans = df[a & b & c]

Try

ans = (df
       .loc[lambda x: x['age'].between(10,50)]
       .loc[lambda x: x['name'] != 'Bob']
       .loc[lamdba x: x['city'] == 'London']
)

This processes fewer rows after every filter.

Or use something like Dask ?

I’ve had good results with .parquet files + dask.

3 Likes

That’s interesting! I’ll see if that helps.

I’m actually not sure if the filtering is where the bottleneck is, or if it’s the initial load of the [full] data frame from disk. My guess is it’s the latter, in which case I may have to use a combination of things – or abandon Pandas altogether – to keep the loads slimmer. But it really is disheartening to invest so much in to this one solution and have it not only not scale, but not scale for just one connection! I shudder to think how it’d do actually deployed on a server :slightly_frowning_face:

But I’ll definitely give that a shot and would love to hear any other advice!

Hi @Jeremy, welcome to the Streamlit community!

I think @mkleinbort is on the right track (hi, welcome to the Streamlit community also! :laughing:) with method chaining, but I’m also surprised that the data loading is so slow. Any chance you can share the code so I can get a better understanding of what sort of processing is happening?

Best,
Randy

Thanks for the reply, @randyzwitch . I’ll note that I need to update to the latest version of Streamlit, and I haven’t implemented the chaining yet to see if that helps. ButI just ran things again and it basically ground to a halt, so I’m pretty sure I’m doing something wrong!

Here’s some code I use to load the initial set of dataframes

@st.cache
def load_parquet(years):
    # years comes from a st.slider
    files = list()
    for year in range(years[0], years[1]+1):
        # Each parquet file is O(15-20) MB
        filename = f"/path/to/parquet/files/{year}.parquet"
        # data_cols is O(20) or so columns
        df = pd.read_parquet(filename, columns=data_cols)
        files.append(df)
    df = pd.concat(files)
    df = df.reset_index()
    return df

My [terrible] spidey sense is telling me this might be a bad idea. My hope was to only load the data for the years needed, but maybe it just makes sense to throw it all into one big DF and load it regardless?

(I’ll note that I basically do this load right after the years slider, but before all the rest of the selection components. It feels like this should be okay, since it’s cached and the data shouldn’t be reloaded unless someone changes the year selection but maybe not?)

I use the Streamlit UI components to generate masks. These vary widely (sliders, multi selects, checkboxes, etc.), but they’re basically relatively straightforward things like

names = st.sidebar.multiselect('Select names', helper.ALL_NAMES, default=default_name, key='name_select')
masks['names'] = data['names'].isin(names)`

(noting to be careful to keep the multiselect options to < 150 or so). I then apply all my masks with:

# Mask aggregator
def aggregate_masks(data, masks, operator='and'):
    if operator == 'and':
        return data[reduce(np.logical_and, masks.values())]
    elif operator =='or':
        return data[reduce(np.logical_or, masks.values())]

and then I use a hack I found on these forums to display a subset of the “final” DF

ret = aggregate_masks(data, masks)

page_size = 1000
page_number = st.number_input(
    label="Page Number",
    min_value=1,
    max_value=math.ceil(len(ret)/page_size),
    step=1,
)
current_start = (page_number-1)*page_size
current_end = page_number*page_size if len(ret) > page_number*page_size else len(ret)

st.write(ret[current_start:current_end])
st.write(f"DIsplaying rows {current_start} to {current_end}")

That’s basically it as far as [semi-obvious?] problem spots. As I mentioned, the rest is just UI-to-mask conversion (that I spent a fair amount of time on, and would love to keep vs. converting it all to, say, SQL queries or what have you).

So my hope is that I’m just doing something dumb with the initial loading of the data, and/or the shift to chaining will speed things up. But if anyone sees anything else that’s out of whack, I’d love to know about it.

Thanks for the help!

Hi Jeremy,

To my eyes your filtering looks quite solid. I have a few pointers below which may or may not help you improve the speed :slight_smile:

Regarding the years slider I think it can be a good idea, although it can in some cases also be beneficial to just do the data loading once and then perhaps control the years through filters on the loaded data. Be aware though that the slider executes not on mouserelease, but on change and thus using a slider can make the function trigger an unwanted amount of times, perhaps a number input would serve the purpose better.

I’ve also found that adding an st.cache on a function like your mask aggregator can result in significant speedup, but of course at the cost of memory. In addition, you may consider splitting the aggregate_masks function up such that you have a function for creating the masks, i.e. returning something like reduce(np.logical_and, masks.values()) and a function to apply this filtering to the data, and make sure that the final mask is calculated before applying the filtering to data.

Finally, I found it extremely helpful myself to time the execution of individual codesnippets and/or functions to narrow down my issues and found bottlenecks where I would’ve never expected them… Perhaps the two ways of making timers below can be helpful to you.


def timer(func, enabled=ENABLE_TIMER):
    """Timer decorator which decorates a function such that it logs time spend.

    Usage:
    @timer
    def my_function():
        do stuff

    see also https://realpython.com/python-timer/#a-python-timer-decorator
    """
    if not enabled:
        return func

    @wraps(func)
    def _time_it(*args, **kwargs):
        start = int(round(time() * 1000))
        try:
            return func(*args, **kwargs)
        finally:
            end_ = int(round(time() * 1000)) - start
            logger.info(
                f"{func.__name__} - execution time: {end_ if end_ > 0 else 0} ms"
            )

    return _time_it


class TimerError(Exception):
    """A custom exception used to report errors in use of Timer class"""


@dataclass
class Timer:
    """Timer to be used when timing something else than functions
    usage: time
    with Timer(logger=logger, text="Doing stuff")
        ...do stuff

    see also https://realpython.com/python-timer/

    """

    timers: ClassVar[Dict[str, float]] = dict()
    name: Optional[str] = None
    logger: Optional[Callable[[str], None]] = print
    _start_time: Optional[float] = field(default=None, init=False, repr=False)

    def __init__(self, logger, text="") -> None:
        self.logger = logger
        self.text = text

    def create_text(self, text, elapsed):
        return f"{text} - Elapsed time: {elapsed:0.4f} seconds"

    def __post_init__(self) -> None:
        """Add timer to dict of timers after initialization"""
        if self.name is not None:
            self.timers.setdefault(self.name, 0)

    def start(self) -> None:
        """Start a new timer"""
        if self._start_time is not None:
            raise TimerError(f"Timer is running. Use .stop() to stop it")

        self._start_time = perf_counter()

    def stop(self) -> float:
        """Stop the timer, and report the elapsed time"""
        if self._start_time is None:
            raise TimerError(f"Timer is not running. Use .start() to start it")

        # Calculate elapsed time
        elapsed_time = perf_counter() - self._start_time
        self._start_time = None

        # Report elapsed time
        if self.logger:
            self.logger.info(self.create_text(self.text, elapsed_time))
        if self.name:
            self.timers[self.name] += elapsed_time

        return elapsed_time

    def __enter__(self):
        """Start a new timer as a context manager"""
        self.start()
        return self

    def __exit__(self, *exc_info):
        """Stop the context manager timer"""
        self.stop()

Be aware that the examples use a ‘logger’ class. You get a logger fromthe python standard library, something like

import logging
logger = logging.getLogger("timing_logger")

will do the trick I think. Otherwise you can replace the logging with print statements or perhaps even st.write statements

Best regards,
Peter

2 Likes

If I had to guess, I suspect this operation might be more expensive than you think. I would think this would create a lot of internal copies (this is a pandas quirk), which could be using more time and memory than your original data would suggest.

For 15-20MB parquet files, I would expect there would be near instant load times. @PeterT 's post about timing individual parts seems like a good next step, to really understand the performance characteristics at play here.

Best,
Randy