Help us stress test Streamlit’s latest caching update

Hey Community :wave:,

When building Streamlit apps, it’s always a good idea to wrap inside an @st.cache all expensive computations and slow data fetches. But as well as st.cache works, in many cases we also recognize that it fails when encountering certain objects like Tensorflow sessions, SpaCy objects, Lock objects, and so on.

So over the past months we started slowly releasing several improvements to how st.cache works. These improvements fall into 3 categories:

  1. Improvements to the caching logic. For example, we now support caching custom classes out of the box, we have better support for tuples, etc.

  2. Improvements to error messages and accompanying documentation.

  3. A new keyword argument called hash_funcs which allows you to customize the behavior of st.cache for your specific use case. In particular, if you ever encountered an object that st.cache couldn’t handle, hash_funcs now allows you to fix that yourself!

You can find out more about all these changes in our docs:

We’re super excited to release all these changes, but also realize they’re all still very new, and full of rough edges! So we would love some help tracking down issues so we can solve them ASAP.

If you encounter any problems with the latest st.cache updates, please post to this thread. Specifically whenever you see the warning “Cannot hash object of type _______” let us know the name of that object, and provide a short code snippet if possible.

Thank you for your help in making Streamlit better, and we also welcome any other feedback or ideas you have on caching!

5 Likes

I’ll start!

I’m having an issue with caching a loaded tensorflow hub model. I get an UnhashableType error on the type ‘google.protobuf.pyext._message.RepeatedScalarContainer’. The error suggest using hash_funcs but I can’t access that type so it doesn’t work. I tried wrapping the model in a custom object and forcing the ‘id’ as hashing function, but this doesn’t work either.

I don’t really know if I’m forgetting something obvious or not. Code sample below:

@st.cache
def get_model():
    return hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

And the error I get is:

UnhashableType: Cannot hash object of type google.protobuf.pyext._message.RepeatedScalarContainer

Thanks in advance and keep up the good work!

1 Like

Hi @Snertie – Can you tell me what happens when you do something like this:

from google.protobuf.pyext._message import RepeatedScalarContainer 

[...your code...]  

@st.cache(hash_funcs={RepeatedScalarContainer: id})
def get_model():
    return hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

?

1 Like

Thanks @nthmost!

I had conflicting packages that wouldn’t let me import the RepeatedScalarContainer but that’s fixed now! Although I now get another error:

UnhashableType: Cannot hash object of type _thread.RLock

FYI the type of the loaded model (which I apparently can’t reach) is returned as:
tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject

I resolved the issue using allow_output_mutation=True.
Thanks for the help!

2 Likes

Hi, here is another one:

Cannot hash object of type CompiledFFI

It happens when trying to create a connection to a Snowflake database. I provide the code below for exhausitivity but I’m not sure it really helps, since the CompiledFFI class is not Snowflake-specific. Thing is I don’t even know where to find this class to implement a custom hash_func… And it is quite annoying cause I do need to cache the results of database queries.

Thanks for your great software and your assistance :slight_smile:

import snowflake.connector

@st.cache
def get_database_connection():
    return snowflake.connector.connect(
       user='XXXX',
       password='XXXX',
       account='XXXX'
    )
2 Likes

Hi @romeodespres,

In your case, it might work to use allow_output_mutation=True in your st.cache declaration. I.e.:

@st.cache(allow_output_mutation=True)
def get_database_connection():
    return snowflake.connector.connect(
       user='XXXX',
       password='XXXX',
       account='XXXX'
    )

The reason is that this will prevent Streamlit from trying to hash this connection object as part of its key.

Let us know if that works!

2 Likes

It does work, thank you! Now that you say it it seems obvious. Shouldn’t the error message suggest your solution? I believe one reason I didn’t think of it is that the message strongly pointed toward hash_func.

While caching some code, Streamlit encountered an object of type CompiledFFI. You’ll
need to help Streamlit understand how to hash that type with the hash_funcs argument.
For example:

@st.cache(hash_funcs={CompiledFFI: my_hash_func})
def my_func(...):
    ...

Please see the hash_funcs documentation for more details.

A short “You can also set allow_output_mutation=True to disable hashing” at the end would have helped me.

2 Likes

UnhashableType : Cannot hash object of type re.Pattern

The function cached is:

def get_config(filename=None, appname=‘your name’):

which returns a ConfigParser object. Which I do want cached!

The hash_funcs noop works
@st.cache(hash_funcs={re.Pattern: lambda _: None})

1 Like

Hi @knorthover,

It sounds like you got your cache function working using hash_funcs. Just wanted to comment for the sake of the thread that yours is also a situation that could be fixed by use of allow_output_mutation=True.

Thanks for chiming in!

1 Like

It seems that I cannot use super() in a class declaration inside a cached function.
I am trying to use an object that requires import taking a long time, I therefore want to place the imports and the class declaration inside the cached function, however, as soon as I add super to subclass, I get the following error

UserHashError : ’ class

I made the following code to highlight the issue:

import streamlit as st


@st.cache()
def state():
    class Parent:
        def test(self):
            return "parent"

    class child(Parent):
        def test(self):
            par = super().test()
            return "hello"

    test = child()
    return test.test()

st.text(state())

Resulting in the error:

UserHashError : ’ class

Error in C:\Users\xxxxx\Devel\RICS\rics-gui-web\st_test_class.py near line 11 :


If you think this is actually a Streamlit bug, please file a bug report here.

Traceback:

  File "C:\Users\xxxxx\st_test_class.py", line 19, in <module>
    st.text(state())

If we remove the super() line, everything runs as expected.
Is this a bug or am I missing something?

Hey @hcoohb - this looks like a bug! Are you able to move your class declaration out of the cached function, or does it rely on values from within that scope?

In the meantime, I’ve filed a bug, because this shouldn’t be happening (or at the very least, we should have a better error message)!

@tim, thanks for creating the bug report!
For now I can move the class declaration outside the cache but it would much neater to move that back inside my cached function, so I will monitor the bug tracker :wink:

Quick update that we’re tracking the mentioned “Cannot hash object of type _______” issues in the following GitHub issues:

Thanks all for helping to track these down :heart:

1 Like

Hi ,

I am using dask for handling large data in the backend and showing a handful of data on the UI.

As we know, if any state of a widget gets changed, Streamlit loads the UI from start.

Dask uses async taks to send and receive large amount of data to any library calls.

I need to hash the dask dataframe, but gives out an error “Cannot hash object of type _asyncio.Task”
and asks me to create a hash function for handling type of “_asyncio.Task”


import streamlit as st
import dask.dataframe as dd

@st.cache()
def get_head(dataframe):
    head = dataframe.head()
    return head

data = dd.read_csv("abcd.csv")
head = get_head(data) ## Causes Error saying "Cannot hash object of type _asyncio.Task"


Gives out below error.


UnhashableType: Cannot hash object of type _asyncio.Task

While caching some code, Streamlit encountered an object of type _asyncio.Task. 
You’ll need to help Streamlit understand how to hash that type with the hash_funcs argument. For example:


@st.cache(hash_funcs={_asyncio.Task: my_hash_func})
def my_func(...):
    ...

Error only comes when i try to put get_head() function in a library code [ python package]
If i use the function from the same file, it runs without giving any error.

In general i need to have a hash function for type of _asyncio.Task.

Any help would be appreciated.

Thanks

2 Likes

Hey @pavansanghavi and welcome to the community :wave:,

Thanks for reporting this, we’re now tracking it as Github issue 1253. Will update the thread when we have more info on it, but feel free to comment on or track the GitHub issue if you’d like as well!

1 Like

Hey all :wave:,

0.57.0 was released yesterday evening which now gives more detailed st.cache error messages to help with debugging. Also, as of 0.57.0, Streamlit now natively supports types re.pattern @knorthover and bytesIO/stringIO :partying_face:.

Going forward, if anyone comes across a “Cannot hash object of type _____” error message and needs help, please provide the full error message available on 0.57.0. Feel free to let us know if you have any questions and we’ll message the thread when we have more updates!

Hi @pavansanghavi, could you explain what you mean by this? I’m trying to reproduce but having issues.

I’m trying to cache the results for the following function:

@st.cache()
def load_lunch_tasks(rider_ids,df_tasks):
    all_lunch_tasks = np.array([np.mean(ins.get_lunch_tasks(rider_id, df_tasks)) for rider_id in rider_ids])
    return all_lunch_tasks

but I get the following error:

KeyError : ‘workday’

Streamlit encountered an error while caching the body of load_lunch_tasks() . This is likely due to a bug in codebase/insights.py near line 127 :

  if arrived.day == workday and dt.time(10,30) <= arrived.time() <= dt.time(12,30)] )  # and completed.time()
               for workday in days_worked]
lunch_tasks = list(filter(lambda ts: ts != 0, lunch_tasks))

Here is the full function below that seems to be the problem. Do you have any idea what the issue might be?

def get_lunch_tasks(rider_id, df=None):
    rider_jobs = np.unique(df.query("FleetId==@rider_id")['bookingId'].values)
    jobs_start_end = pd.DataFrame([get_job_start_end(job_id, df) for job_id in rider_jobs if get_job_start_end(job_id, df) is not None])
    days_worked = np.unique(jobs_start_end.start.dt.day)
    lunch_tasks = [len([arrived for arrived, completed in zip(jobs_start_end.start,jobs_start_end.finish)
      if arrived.day == workday and dt.time(10,30) <= arrived.time() <= dt.time(12,30)] )  # and completed.time()
                   for workday in days_worked]
    lunch_tasks = list(filter(lambda ts: ts != 0, lunch_tasks))
    return lunch_tasks