Concurrency in a expensive cached dataclass

Hi there,

I am using streamlit to create a dashboard with plots coming from different dataframes. The app computes these dataframes based on some text widgets contained in a streamlit form and stores them in a dataclass. The computation is very expensive so I use streamlit cache to cache the resulting dataclass. I created a mini snippet of code which is enough to reproduce the error and get an idea of the context.

import time
from dataclasses import dataclass

import pandas as pd
import streamlit as st


@dataclass
class GroupOfDataFrames:
    df1: pd.DataFrame
    df2: pd.DataFrame
    df3: pd.DataFrame


@st.cache_data(show_spinner=False, max_entries=10)
def get_data(parameter_1: str, parameter_2: str) -> GroupOfDataFrames:
    df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    df2 = pd.DataFrame({"c": [7, 8, 9], "d": [10, 11, 12]})
    df3 = pd.DataFrame({"e": [13, 14, 15], "f": [16, 17, 18]})

    time.sleep(60)
    return GroupOfDataFrames(df1, df2, df3)


with st.form(key="my_form"):
    parameter_1 = st.text_input("Parameter 1", value="a")
    parameter_2 = st.text_input("Parameter 2", value="b")
    submit_button = st.form_submit_button(label="Submit")

    if submit_button:
        data = get_data(parameter_1, parameter_2)
    else:
        st.stop()


st.write("Dataframe 1")
st.dataframe(data.df1)
st.write("Dataframe 2")
st.dataframe(data.df2)
st.write("Dataframe 3")
st.dataframe(data.df3)

The snippet runs smoothly when there is a single user interacting with the app. The issue comes when one user is waiting for a result and another user queries the same data (form with the same parameters). To reproduce the error, launch the app and open two tabs. Submit the form with the default parameters in one tab and click on submit on the second tab with the exact same parameters.

The traceback I am getting is:

UnserializableReturnValueError: Cannot serialize the return value (of type __main__.GroupOfDataFrames) in get_data(). st.cache_data uses pickle to serialize the function’s return value and safely store it in the cache without mutating the original object. Please convert the return value to a pickle-serializable type. If you want to cache unserializable objects such as database connections or Tensorflow sessions, use st.cache_resource instead (see our docs for differences).
Traceback:

File "/path-to-project/.venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 534, in _run_script
    exec(code, module.__dict__)
File "/path-to-project/streamlit_test.py", line 31, in <module>
    data = get_data(parameter_1, parameter_2)
File "/path-to-project/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 212, in wrapper
    return cached_func(*args, **kwargs)
File "/path-to-project/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 243, in __call__
    return self._get_or_create_cached_value(args, kwargs)
File "/path-to-project/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 267, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
File "/path-to-project/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 343, in _handle_cache_miss
    raise UnserializableReturnValueError

I am using cache_data since GroupOfDataFrames is serializable. In any case, cache_resource does not work either.

The app runs in a kubernetes cluster service but the error I am having is reproducible locally.

I am using python version 3.10.8 and streamlit 1.29.0.

Could you please help me?

I was able to reproduce your issue (Python 3.11.7, streamlit 1.30.0) and it looks like a bug to me.

serializable_bug

I found that with or without the dataclass decorator, your custom class is still serializable and it gets stored in the cache. It is not a problem with pandas dataframes either, the contents in the custom class can be anything and the exception is raised as well.


Smaller version
import time
import streamlit as st


class GroupOfData:
    def __init__(self, df1, df2, df3) -> None:
        self.df1 = df1
        self.df2 = df2
        self.df3 = df3


@st.cache_data(show_spinner=False, max_entries=10)
def get_data(parameter_1: str, parameter_2: str) -> GroupOfData:
    time.sleep(10)
    return GroupOfData(1.0, "2", 3)


with st.form(key="my_form"):
    parameter_1 = st.text_input("Parameter 1", value="a")
    parameter_2 = st.text_input("Parameter 2", value="b")
    submit_button = st.form_submit_button(label="Submit")

    if submit_button:
        data = get_data(parameter_1, parameter_2)

        st.write("Dataframe 1")
        st.write(data.df1)
        st.write("Dataframe 2")
        st.write(data.df2)
        st.write("Dataframe 3")
        st.write(data.df3)

@jordim I will git it a try using st.cache_resource instead since GroupOfDataFrames is slightly different than your traditional dataframe. I tried it and was able to bypass the error after it.

@CarlosSerrano, @edsaac thank you both for the replies. I tried cache_resource in my app and indeed it seems to work. What are the implications in terms of memory performance? Will I need to increase the server specs where the app is running on?

From the doc it seemed clear to me that I have to use cache_data.

I think it depends on whether you expect users to mutate the GroupOfDataFrames objects, because cache_resource stores a single shared copy for all users. If a cache entry is modified, that modified version will remain stored and visible to users that access the same cache entry later.

serializable_resource