How to cache multiple datasets?

Hi both of you, and welcome to the forums @ryandaher :slight_smile: !

As a prerequisite, the caching mechanics of Streamlit revolve around comparing the cached function’s input parameters, function body and external functions/variables involved to return the cached value.
So if you call load_data1() multiple times, you should always expect the same result since nothing has changed on this function.
As such Streamlit can just cache the result to return it back to you each time you rerun the app (which happens any time you interact with the app through sliders, dashboard, etc…), so you can expect that in your Streamlit app, when you don’t change your input parameters/body function you get the same result.

Now if you mutate the value of the output dataframe outside the cached function, Streamlit warns you with CachedObjectMutationWarning to ensure you understand when displaying the data that you don’t get exactly the data from the cache but a different version of it even though you did not change input parameters/external values (which is contrary to what Streamlit wants you to assume)

import pandas as pd
import streamlit as st

@st.cache
def load_data1():
    data1 = pd.read_csv("data/occupancy.csv")
    return data1

@st.cache
def load_data2():
    data2 = pd.read_csv("data/occupancy.csv")
    return data2

@st.cache
def load_data3():
    data3 = pd.read_csv("data/occupancy.csv")
    return data3

data1 = load_data1()
data2 = load_data2()
data3 = load_data3()

#data2["hello"] = "hi" # <--- uncomment that to get your warning. 

# If you use your cached dataframe after this point, 
# you understand you don't use the cached data like expected :/
# better to copy that in another variable then

In a more complex app where you may mutate the result in multiple code paths that change considering your behavior, or if some object mutate your returned cached data in different ways (Tensorflow or SpaCy have this ability in specific usecases), this makes it easier to debug.

If you understand how exactly you mutate the data in your logic, tell Streamlit you’re the boss by replacing @st.cache with @st.cache(allow_output_mutation=True) and continue coding happily.

@khalido’s method of deepcopying a transformed/filtered version of it before graphing it should be the better solution since you know that way you can count on the loaded dataframe to not be mutated by another part of the bigger app. And if you don’t have too much input parameters you may even put that in another cached function like transform_data(df, params):....

Can you tell more about this ? We can then check if some advanced Streamlit caching techniques can help you with maintaining this logic !

You can readmore about it in the Streamlit caching doc on mutated values or techniques to solve it.

3 Likes