Using cache functionality + hashing


I created a function to read a csv. That csv is update with no defined frequency, and I would like to updated it only if the last modified date of that csv has changed.

Steps to reproduce

So far I got something like this:

def get_last_modified(bucket):
    s3 = S3FileSystem(anon=False)
    last_modified = s3.modified(bucket)
    return last_modified

@st.cache(hash_funcs={StringIO: get_last_modified})
def load_data(bucket):
    df = pd.read_parquet(bucket)
    return df

Expected behavior:

Not sure how to do it, but I would like the load_data function to run when the last modified date is updated

Actual behavior:

I get the last modified date correctly, but I cannot make the load_data function to rerun

Debug info

  • Streamlit version: 1.11.0
  • Python version: 3.9
  • Using Conda

Any help will be apreciated, thanks in advanced

How are you calling load_data? I would expect bucket to be a str, but then defining a hash function for StringIO would do nothing.

I think just passing last_modified as a parameter to load_data should work.

1 Like

Hi Goyo! Thanks for the answering, it worked! I added β€œlast_modified” as a parameter to the load function. I even deleted the hash function as u stated. This is extremely weird (and beautifully easy fortunately). Then all I need to rerun the function load_data is the parameter (last_modified) that states if it has tu be rerun, like this:

def load_data(bucket, last_modified):
    df = pd.read_csv(bucket + "data.csv")
    return df

df = load_data(bucket, last_modified)

I don’t quietly understand how it works, though!

The data will be stored in the cache associated to the values of bucket and last_modified. So both values will be used to decide whether there is a cache hit or a cache miss, even if the function only needs one of them.

1 Like

Excellent, that would make it. Thanks a lot!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.