How to refresh cache when a file loaded from a url is updated?

Hello,
The csv file is updated daily and I would like the cache to refresh when it detects a change.
I can access the last update date in the headers but I don’t know how to use it with the hash_funcs argument. I’ve tried some variations of the following:

import requests

url = 'https://www.data.gouv.fr/fr/datasets/r/83cbbdb9-23cb-455e-8231-69fc25d58111'

class FileReference:
    def __init__(self, url):
        self.url = url

def hash_file_reference(url):
    r = requests.get(url)
    return r.headers['Date']

@st.cache(hash_funcs={FileReference: hash_file_reference})
def load_data():
    global url
    df = pd.read_csv(url)
    ....

Is it possible to use the hasher on a global variable ?
I’ve also tried to use the url as an argument of load_data() but to be honest I don’t really know what I’m doing.

Any help is appreciated.

Any help appreciated!

Salut @lasticot , welcome to the community!

I’ve had a quick look, and my first idea is to actually get rid of hash_funcs and put the date of update as argument of the method:

import pandas as pd
import requests
import streamlit as st

url = 'https://www.data.gouv.fr/fr/datasets/r/83cbbdb9-23cb-455e-8231-69fc25d58111'

r = requests.get(url)
latest_update = r.headers['Last-Modified']

@st.cache(suppress_st_warning=True)  # <-- the suppress warning you can remove when you remove the st.write('Nothing in cache')
def load_data(url, date):
    st.write(f"NOTHING IN CACHE FOR {url}/{date}") # <-- only to debug, should not appear if result is taken from cache instead of recomputed. Remove beforedeploying.
    return pd.read_csv(url)

st.dataframe(load_data(url, latest_update))

This way, if a pair url; latest_update --> DataFrame is computed, the resulting DataFrame will stay in cache for this pair of inputs until a new latest_update input is provided.

NB: Careful in your example, it seems the Date header doesn’t give the latest update date but rather the download date, you’ll need to use Last-Modified instead

If it’s a very long running app I would also suggest st.cache(max_entries=10) to remove the 10 older entries in cache.


Now if you want to continue using hash_funcs then

  • your hash_file_reference should take a FileReference instance as input
  • There should be a FileReference called somewhere in your load_data method, either as input argument or in the body, that when Streamlit gets to one it knows how to process it.

So here’s an alternative :

import pandas as pd
import requests
import streamlit as st

url = 'https://www.data.gouv.fr/fr/datasets/r/83cbbdb9-23cb-455e-8231-69fc25d58111'

class FileReference:
    def __init__(self, url):
        self.url = url

def hash_file_reference(f: FileReference):
    r = requests.get(f.url)
    return r.headers['Last-Modified']

@st.cache(hash_funcs={FileReference: hash_file_reference}, suppress_st_warning=True)
def load_data(file: FileReference):
    st.write(f"NOTHING IN CACHE FOR {file}")
    return pd.read_csv(file.url)

st.dataframe(load_data(FileReference(url)))

that way the cached function should run only if FileReference's date header changes.


I prefer the first solution though, because you clearly define in the inputs of the cached method the pair of url/date you want to put into Streamlit cache :slight_smile: I’d rather keep hash_func for processing complex objects like Matplotlib figures.

Hope this helps,
Fanilo :balloon:

2 Likes

Looks fab @andfanilo! I’m bookmarking this for a try later! :heart_eyes:

1 Like