Unexpected st.cache_data behavior (not caching)

Binx · August 21, 2023, 5:39pm

Summary

Using st.cache_data with an “expensive” function:
1st run time: ~5 sec
2nd run time: ~5 sec
2nd run time expected: 0 sec

Each time a new “Project” is chosen, the function does not run. Which indicates that the caching is working as expected. Yet, you can see that the top right “RUNNING” process is taking just as long as the initial run.

Steps to reproduce

Code snippet:

import streamlit as st
import time

@st.cache_data
def expensive_op():
    print('Function runs')
    my_list = []
    for x in range(1, 80_000_000, 1):
        my_list.append(x)
    return my_list

start_time = time.time()
my_variable = expensive_op()
box = st.selectbox('Project', ['project_1', 'project_2', 'project_3'], key='project')
print("Dataset Loading Time", time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

Debug info

Streamlit version: 1.24.0
Python version: 3.10
Using Conda

Links

Similar issue found here:

github.com/streamlit/streamlit

`st.cache` is super slow

opened 09:15AM - 27 Dec 19 UTC

closed 07:15AM - 28 Dec 19 UTC

AmitMY

type:bug

# Summary `st.cache` takes forever to load from cache. If I understand corre…ctly, you must be using disk cache rather than memory cache. # Steps to reproduce 1. Create an app with a cached large dataset: https://pastebin.com/Ux0wPxf6 2. `load_dataset()`, and rerun 3. Add some timing: ```py start_time = time.time() entire_dataset = load_dataset() print("Dataset Loading Time", time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))) ``` ## Expected behavior: Output should be: > Loading Dataset... > Dataset Loading Time 00:00:48 # Not cached > Dataset Loading Time 00:00:00 # Cached ## Actual behavior: Output is: > Loading Dataset... > Dataset Loading Time 00:00:48 # Not cached > Dataset Loading Time 00:00:19 # Cached ## Is this a regression? That is, did this use to work the way you expected in the past? yes / **no** # Debug info - Streamlit version: 0.52.2 - Python version: 3.8.0 - Using **Conda**? PipEnv? PyEnv? Pex? - OS version: centos - Browser version: chrome 79

JamiuS · August 22, 2023, 10:33am

The cache_data depends on a parameter to either return a cache or rerun the function for a new data base on the value of the parameter.
E.g:

@st.cache_data
def expensive_op():

The above function takes no parameter, in this case, a cache will always be returned.
You meet not notice this behavior when running app locally until after deployment.

To make cache_data function properly, feed your function with some argument.
E.g:

@st.cache_data
def expensive_op(your_arg):

With the demo above, cache_data will allow function to rerun for new data when your_arg changes.

Goyo · August 22, 2023, 10:52am

I ran your code with streamlit 1.25.0 and it printed 00:00:06 in the first run and 00:00:03 in subsequent runs. The cache is definitely working.

Binx · August 22, 2023, 11:54pm

Interesting, So why is there still a 3 second load though? Is that just due to the size of the data that is loading in?

Binx · August 23, 2023, 12:14am

Unfortunately I cannot share my original code due to privacy issues (which is the case why I made the original “expensive function”), I did solve my problem and here is what I have learned/did:

My original code had three caching function where two of them depended on the first one.
For example:

@st.cache_data
def function_one():
    some_data = 'x.y.z'
    return some_data

@st.cache_data
def fucntion_two(var):
    my_split = var.split('.')
    return my_split

@st.cache_data
def fucntion_three(var):
    my_split_join = ' '.join(var.split('.'))
    return my_split_join


func_1 = function_one()
func_2 = function_two(func_1)
func_3 = function_three(func_1)

When I first launch the app, function_one() would take ~30 seconds and then function_two and function_three would each take ~5 seconds (obviously these are not the functions I am using). Then each time a feature was changed (radio button pressed, selectbox changed, etc.), function_one() would take nearly 0 sec, while function_two() and function_three() both took roughly the same time as the initial launch time (5 sec).

I believe this was due to loading a cache variable into another cache function. I ended up just combining the three functions and that solved the problem.

@st.cache_data
def function_one():
    some_data = 'x.y.z'
    my_split = some_data.split('.')
    my_split_join = ' '.join(some_data.split('.'))
    return some_data, my_split, my_split_join

func_1, func_2, func_3 = function_one()

Goyo · August 23, 2023, 9:44pm

I wonder that too. I notice that if I make the funtion teturn a DataFrame instead of a list (this is return pd.Series(my_list) instead of just return my_list) then the first call takes way more time (something like 20 seconds IIRC) but after that it was almost instantaneous.

Data cached with cache_data must be serialized on cache misses and deserialized on cache hits. It is only natural that different objects take different times to serialize and deserialize and that has an impact on performance. But I cannot tell for sure if that is what we are seeing here.

system · August 25, 2023, 9:45pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using st.cache Using Streamlit cache	2	545	August 4, 2023
Introducing two new caching commands to replace st.cache! Official Announcements	2	1353	February 17, 2024
Running multiple functions at the same time while using Cache (Python) Using Streamlit cache , windows	1	1327	April 27, 2023
Caching doesn't work with databases? Using Streamlit	3	381	August 15, 2022
@st.cache_data Behavior issue Using Streamlit cache , streamlit-cloud , discussion	13	2048	September 9, 2024

Unexpected st.cache_data behavior (not caching)

Summary

Steps to reproduce

Debug info

Links

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies