@st.cache_data VS @st.cache_resource - small issues

Summary

Hi! As streamlit 1.18.1 is released, it’s time to move on and use new caching functions. This should be simple and smooth, but I ran into some troubles:

  1. Official tutorial Connect Streamlit to a public Google Sheet now contains @st.cache_data, however using it causes an error:
UnserializableReturnValueError: Cannot serialize the return value (of type list) in run_query(). st.experimental_memo uses pickle to serialize the function’s return value and safely store it in the cache without mutating the original object. Please convert the return value to a pickle-serializable type. If you want to cache unserializable objects such as database connections or Tensorflow sessions, use st.experimental_singleton instead (see our docs for differences).

@st.cache_resource works ok. Live example: https://flashcards.streamlit.app/

  1. Both @st.cache_data and @st.cache_resource work well for pandas read_csv:
@st.cache_data
def fetch_data(level_name):
    df = pd.read_csv(level_name, sep=",", header=None)
    return df

What is the benefit of using st.cache_data? There are moments when I have a feeling that app with “@st.cache_resource” work better (this, however, may be only my impression). Live example: https://dungeon.streamlit.app/

1 Like

Thanks for reporting this @TomJohn. Folks on Streamlit engineering are looking into it now.

1 Like

Oh, to your #2 question - For df = read_csv() unless you have a VERY large data set, it’s definitely more canonical to use st.cache_data(). One of the main reasons is that for cache_resource, any mutation to the function output (like a column transform, or add/edit/remove data) is persisted across app runs and across sessions. For a lot of use cases this is not desired. With cache_data, the function result for a given input is cached and a new, clean copy is provided on every run.

Does this make sense? Some more info at Caching - Streamlit Docs and we have a blog coming out about it tomorrow too.

1 Like

Hi @jcarroll Thank you! I think it’s a bit clearer now. In “The Dungeon,” I always want to load level design without any changes, so using “st.cache_data()” is a good choice.

Honestly, I am not sure about this. Assuming you have some kind of “default level data” that comes shipped with your app, as in: Every user of the app will use this data, I’d argue you’d also be good with using st.cache_resource.
The advantage is that st.cache_resource will not create copies of the same object across sessions. However, you as the developer would have to make sure, that the data is not mutated as @jcarroll already pointed out.

Correct me if I’m wrong guys. :slight_smile:

Edit: Even though the use case is different, what I meant is similar to this section in the docs: Caching - Streamlit Docs

2 Likes

Interesting points. Thank you @Wally! I certainly must test what would happen if I used st.cache_resource and

  1. User 1 would trigger fetch_data("level1.csv")
  2. User 2 would trigger fetch_data("level2.csv")

…assuming that I will add new levels soon :grinning:

@TomJohn for the first issue with the Google Sheet example - I found that this was due to gsheetsdb Rows object which is returned being not serializable. I tested a few other DB API implementations (Postgresql, SQLite) and they did not have this issue. It also seems like gsheetsdb is a bit stale so maybe not the best to use in our example.

I pulled down your flashcards app and was able to get it working with a much simpler pandas pd.read_csv() approach

import pandas as pd

@st.cache_data(ttl=600)
def load_data(sheets_url):
    csv_url = sheets_url.replace('/edit#gid=', '/export?format=csv&gid=')
    return pd.read_csv(csv_url)


# ok let's load the data
questions_df = load_data(st.secrets["public_gsheets_url"])

With this approach you retrieve the values a little differently but it’s pretty close. Use questions_df.iloc[st.session_state.q_no].Question instead of rows[st.session_state.q_no].Question, for example.

I filed a bug to fix the tutorial: Example code in public google sheets tutorial is broken · Issue #589 · streamlit/docs · GitHub

2 Likes

Hi! @jcarroll thanks! Definitely exceeding expectations :slight_smile:

Regarding gsheetdb: It seems to be deprecated and was superseded by shillelagh : GitHub - betodealmeida/shillelagh: Making it easy to query APIs via SQL

Not sure if shillelagh solves the problem though. But might be worthwhile to update the example in the docs. :slight_smile:

1 Like

Thanks! I saw this and spent a few minutes trying to install shillelagh in the example app and was unable to get it working - seems like it installed many more dependencies and made the install / usage more complex. Since there was a quick solution with no new dependencies required using pandas, I proposed we just update the example to use that.

1 Like

Thanks, all! There’s a PR out to fix the issue in the public google sheets tutorial :+1: We still have to update the private google sheets tutorial to use a gsheetsdb alternative. If you have suggestions in addition to shiellelagh, please let me know :smile:

2 Likes

I just updated my code with the caching. st.data_cache

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.