Using st.cache_data with an “expensive” function:
1st run time: ~5 sec
2nd run time: ~5 sec
2nd run time expected: 0 sec
Each time a new “Project” is chosen, the function does not run. Which indicates that the caching is working as expected. Yet, you can see that the top right “RUNNING” process is taking just as long as the initial run.
Steps to reproduce
import streamlit as st
my_list = 
for x in range(1, 80_000_000, 1):
start_time = time.time()
my_variable = expensive_op()
box = st.selectbox('Project', ['project_1', 'project_2', 'project_3'], key='project')
print("Dataset Loading Time", time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))
When I first launch the app, function_one() would take ~30 seconds and then function_two and function_three would each take ~5 seconds (obviously these are not the functions I am using). Then each time a feature was changed (radio button pressed, selectbox changed, etc.), function_one() would take nearly 0 sec, while function_two() and function_three() both took roughly the same time as the initial launch time (5 sec).
I believe this was due to loading a cache variable into another cache function. I ended up just combining the three functions and that solved the problem.
I wonder that too. I notice that if I make the funtion teturn a DataFrame instead of a list (this is return pd.Series(my_list) instead of just return my_list) then the first call takes way more time (something like 20 seconds IIRC) but after that it was almost instantaneous.
Data cached with cache_data must be serialized on cache misses and deserialized on cache hits. It is only natural that different objects take different times to serialize and deserialize and that has an impact on performance. But I cannot tell for sure if that is what we are seeing here.