Using cache_resource() for large dataframes

Summary

I’m bulding an app that lets users get the schedule for a sports tournament. The data will be read only (i.e., no transformations are being made to it) and its gather from Google Sheets.

Right now I have a function to get the google sheets connection (using cache_resource()), and then a function to get the data (using cache_data()).

@st.cache_resource(show_spinner=False)
def get_google_sheet_connection():
    
    logger.info("Getting google connection!")
    gc = gspread.service_account_from_dict(credentials)
    sh = gc.open('spreadsheet')
    return sh

@st.cache_data(ttl=30, show_spinner=False)
def get_data():
    <code here...>
    return df

The problem is that I have a lot of concurrent users, so, when 50-100 people try to use this app at the same time, even though the connection is cached, get_data gets called from every new session (at least once), and that is using too many resources of the Google Cloud APIS.

Is it a good practice to use cache_resources() for a dataframe, so I have a singleton of that dataset which is shared between all sessions, users and reruns?

@st.cache_resource(ttl=30, show_spinner=False)
def get_data():
    <code here...>
    return df

Thank you so much!

Hi @Ricardo_Recarey,

Thanks for posting!

You can definitely use st.cache_resource for large data because it is faster than st.cache_data (does not copy)…but you must ensure thread safety.

Also, multiple sessions mutating the cache concurrently can corrupt the data so beware of that. You can read more on this in our caching docs.