Using st.cache_data with an “expensive” function:
1st run time: ~5 sec
2nd run time: ~5 sec
2nd run time expected: 0 sec
Each time a new “Project” is chosen, the function does not run. Which indicates that the caching is working as expected. Yet, you can see that the top right “RUNNING” process is taking just as long as the initial run.
Steps to reproduce
Code snippet:
import streamlit as st
import time
@st.cache_data
def expensive_op():
print('Function runs')
my_list = []
for x in range(1, 80_000_000, 1):
my_list.append(x)
return my_list
start_time = time.time()
my_variable = expensive_op()
box = st.selectbox('Project', ['project_1', 'project_2', 'project_3'], key='project')
print("Dataset Loading Time", time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))
The cache_data depends on a parameter to either return a cache or rerun the function for a new data base on the value of the parameter.
E.g:
@st.cache_data
def expensive_op():
The above function takes no parameter, in this case, a cache will always be returned.
You meet not notice this behavior when running app locally until after deployment.
To make cache_data function properly, feed your function with some argument.
E.g:
@st.cache_data
def expensive_op(your_arg):
With the demo above, cache_data will allow function to rerun for new data when your_arg changes.
Unfortunately I cannot share my original code due to privacy issues (which is the case why I made the original “expensive function”), I did solve my problem and here is what I have learned/did:
My original code had three caching function where two of them depended on the first one.
For example:
When I first launch the app, function_one() would take ~30 seconds and then function_two and function_three would each take ~5 seconds (obviously these are not the functions I am using). Then each time a feature was changed (radio button pressed, selectbox changed, etc.), function_one() would take nearly 0 sec, while function_two() and function_three() both took roughly the same time as the initial launch time (5 sec).
I believe this was due to loading a cache variable into another cache function. I ended up just combining the three functions and that solved the problem.
I wonder that too. I notice that if I make the funtion teturn a DataFrame instead of a list (this is return pd.Series(my_list) instead of just return my_list) then the first call takes way more time (something like 20 seconds IIRC) but after that it was almost instantaneous.
Data cached with cache_data must be serialized on cache misses and deserialized on cache hits. It is only natural that different objects take different times to serialize and deserialize and that has an impact on performance. But I cannot tell for sure if that is what we are seeing here.