Unexpected st.cache_data behavior (not caching)

Summary

Using st.cache_data with an “expensive” function:
1st run time: ~5 sec
2nd run time: ~5 sec
2nd run time expected: 0 sec

Each time a new “Project” is chosen, the function does not run. Which indicates that the caching is working as expected. Yet, you can see that the top right “RUNNING” process is taking just as long as the initial run.

Steps to reproduce

Code snippet:

import streamlit as st
import time

@st.cache_data
def expensive_op():
    print('Function runs')
    my_list = []
    for x in range(1, 80_000_000, 1):
        my_list.append(x)
    return my_list

start_time = time.time()
my_variable = expensive_op()
box = st.selectbox('Project', ['project_1', 'project_2', 'project_3'], key='project')
print("Dataset Loading Time", time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

Debug info

  • Streamlit version: 1.24.0
  • Python version: 3.10
  • Using Conda

Links

Similar issue found here:

The cache_data depends on a parameter to either return a cache or rerun the function for a new data base on the value of the parameter.
E.g:

@st.cache_data
def expensive_op():

The above function takes no parameter, in this case, a cache will always be returned.
You meet not notice this behavior when running app locally until after deployment.

To make cache_data function properly, feed your function with some argument.
E.g:

@st.cache_data
def expensive_op(your_arg):

With the demo above, cache_data will allow function to rerun for new data when your_arg changes.

I ran your code with streamlit 1.25.0 and it printed 00:00:06 in the first run and 00:00:03 in subsequent runs. The cache is definitely working.

Interesting, So why is there still a 3 second load though? Is that just due to the size of the data that is loading in?

Unfortunately I cannot share my original code due to privacy issues (which is the case why I made the original “expensive function”), I did solve my problem and here is what I have learned/did:

My original code had three caching function where two of them depended on the first one.
For example:

@st.cache_data
def function_one():
    some_data = 'x.y.z'
    return some_data

@st.cache_data
def fucntion_two(var):
    my_split = var.split('.')
    return my_split

@st.cache_data
def fucntion_three(var):
    my_split_join = ' '.join(var.split('.'))
    return my_split_join


func_1 = function_one()
func_2 = function_two(func_1)
func_3 = function_three(func_1)

When I first launch the app, function_one() would take ~30 seconds and then function_two and function_three would each take ~5 seconds (obviously these are not the functions I am using). Then each time a feature was changed (radio button pressed, selectbox changed, etc.), function_one() would take nearly 0 sec, while function_two() and function_three() both took roughly the same time as the initial launch time (5 sec).

I believe this was due to loading a cache variable into another cache function. I ended up just combining the three functions and that solved the problem.

@st.cache_data
def function_one():
    some_data = 'x.y.z'
    my_split = some_data.split('.')
    my_split_join = ' '.join(some_data.split('.'))
    return some_data, my_split, my_split_join

func_1, func_2, func_3 = function_one()

I wonder that too. I notice that if I make the funtion teturn a DataFrame instead of a list (this is return pd.Series(my_list) instead of just return my_list) then the first call takes way more time (something like 20 seconds IIRC) but after that it was almost instantaneous.

Data cached with cache_data must be serialized on cache misses and deserialized on cache hits. It is only natural that different objects take different times to serialize and deserialize and that has an impact on performance. But I cannot tell for sure if that is what we are seeing here.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.