How to cache multiple datasets?

Hey Streamlit community!

First of all streamlit is amazing, the features are great and so straightforward.

I am not too familiar with caching so maybe someone could shed some wisdom onto my problems.
I am trying to cache 3 different datasets that my dashboard needs to run. They are all static and just need loading once to run. I am doing so this way:

@st.cache
def load_data1():
data1 = pd.read_csv(‘data1.csv’)
return data1

@st.cache
def load_data2():
data2 = pd.read_csv(‘data2.csv’)
return data2

@st.cache
def load_data3():
data3 = pd.read_csv(‘data3.csv’)
return data3

data1 = load_data1()
data2 = load_data2()
data3 = load_data3()

When I run this code however, I get the error: CachedObjectMutationWarning : Return value of load_data2() was mutated between runs.

I have no idea what I’m doing wrong, as I thought that these dataframes are static and don’t change, so they should be fine to cache? If someone can help me out I would appreciate it greatly!

Thank you

I have just been doing a load_static_data func which reads in and returns all the static data. So similar to what you’re doing except in one func which returns a few vaules.

I did this but I keep getting the mutable error. Really frustrating, no clue what’s wrong.

I think I had this problem too - I believe what is happening:

When you load a pandas df, and make changes to it, the original version of the df read in is changed, and @st.cache decorated function sees the orginal df it returned has been changed and raises an error.

I think the cache is trying to return the original dataframe as loaded without reading it again from disk, so when you change the df the cache function either has to reread it from disk (negating the cache) or returns the changed copy (not ideal).

I believe what I did was use df = df_original.copy() somewhere and changed the copy. This kept the cache function happy.

The other thing I faced is, since all the code gets run after every user interaction, I had to throw in some logic to deal with that.

Hi both of you, and welcome to the forums @ryandaher :slight_smile: !

As a prerequisite, the caching mechanics of Streamlit revolve around comparing the cached function’s input parameters, function body and external functions/variables involved to return the cached value.
So if you call load_data1() multiple times, you should always expect the same result since nothing has changed on this function.
As such Streamlit can just cache the result to return it back to you each time you rerun the app (which happens any time you interact with the app through sliders, dashboard, etc…), so you can expect that in your Streamlit app, when you don’t change your input parameters/body function you get the same result.

Now if you mutate the value of the output dataframe outside the cached function, Streamlit warns you with CachedObjectMutationWarning to ensure you understand when displaying the data that you don’t get exactly the data from the cache but a different version of it even though you did not change input parameters/external values (which is contrary to what Streamlit wants you to assume)

import pandas as pd
import streamlit as st

@st.cache
def load_data1():
    data1 = pd.read_csv("data/occupancy.csv")
    return data1

@st.cache
def load_data2():
    data2 = pd.read_csv("data/occupancy.csv")
    return data2

@st.cache
def load_data3():
    data3 = pd.read_csv("data/occupancy.csv")
    return data3

data1 = load_data1()
data2 = load_data2()
data3 = load_data3()

#data2["hello"] = "hi" # <--- uncomment that to get your warning. 

# If you use your cached dataframe after this point, 
# you understand you don't use the cached data like expected :/
# better to copy that in another variable then

In a more complex app where you may mutate the result in multiple code paths that change considering your behavior, or if some object mutate your returned cached data in different ways (Tensorflow or SpaCy have this ability in specific usecases), this makes it easier to debug.

If you understand how exactly you mutate the data in your logic, tell Streamlit you’re the boss by replacing @st.cache with @st.cache(allow_output_mutation=True) and continue coding happily.

@khalido’s method of deepcopying a transformed/filtered version of it before graphing it should be the better solution since you know that way you can count on the loaded dataframe to not be mutated by another part of the bigger app. And if you don’t have too much input parameters you may even put that in another cached function like transform_data(df, params):....

Can you tell more about this ? We can then check if some advanced Streamlit caching techniques can help you with maintaining this logic !

You can readmore about it in the Streamlit caching doc on mutated values or techniques to solve it.

3 Likes