Caching with hash_funcs fails for similar methods

Hello,

First post out here. Found out about streamlit last week, and it has been a game changer in my workflow. Thank you for making it!

Iā€™ve run into some trouble with caching. The following is an example of a case that I donā€™t fully understand. When I uncomment the line, the caching works as I expect it too. However, when commented both functions seem to share the hash which results in them being called every time I press run.

import streamlit as st

MY_HASH = {str: lambda _: None}

@st.cache(max_entries=1, hash_funcs=MY_HASH, suppress_st_warning=True)
def func1(bool_arg, str_arg):
    st.write(f'Ran func 1 - {str_arg}')
    return []

@st.cache(max_entries=1, hash_funcs=MY_HASH, suppress_st_warning=True)
def func2(bool_arg, str_arg, str_arg2):
    st.write(f'Ran func 2 - {str_arg}')
    # st.write(f'Ran func 2 - {str_arg2}')  # Uncomment this line to make the code work as intended
    return []

func1(False, '1')
func2(False, '2', '3')

st.button('run')

I was able to make it work by uncommenting the line, removing the custom hash function, or changing max_entries to 2. Is this expected behavior?

Iā€™m working with streamlit 0.70.0 on python 3.7.9

Thank you,
Saurabh Parikh

1 Like

Hi @Saurabh, welcome to the forum :wave:

Your functions are sharing a cache because they have the same cache key.

We create a unique key for each functionā€™s cache by hashing the function and in your example they have the same hash.

When hashing the function, we hash its default arguments and the byte code of the function body.

Default args for both functions are None and the function bodies are equivalent since youā€™re overriding the hash_func for str to return None.

In essence, this is how your functions appear to the hasher.

def func1(None): <--- defaults
    st.write(None) <--- result of your hash func
    return []

def func2(None):
    st.write(None)
    return []

Since they share a cache and your max_entries is set to 1, the call to func2 is evicting the results from the call to func1.


If youā€™d like to confirm this by seeing the hash value of the function, you could do

import hashlib
import streamlit as st
from streamlit.hashing import _CodeHasher

def check_hash(f):
    hasher = hashlib.new("md5")
    ch = _CodeHasher(hash_funcs=MY_HASH)
    ch.update(hasher, f)
    st.write(hasher.digest())

check_hash(func1)
check_hash(func2)

With that said, can you tell me what youā€™re trying to do?

  • Why max_entries of 1?
  • Why hash str to None?
  • Are you playing around with caching or is there a real world use case?
1 Like

After discussing with the team, there does seem to be a bug here tho!

The two functions shouldnā€™t be sharing a cache.

In fact, we hash the func.module and func.name along with the function defaults and body to differentiate between two functions that would otherwise be the same.

The bug is that weā€™re using the hash_funcs to hash the func module and name when we shouldnā€™t be, since as can be seen in this example, itā€™s causing the func names to hash to the same thing which is unwanted behavior.

Issue filed here

Hi @Jonathan_Rhone,

Thanks for getting back to me. Great that itā€™s a bug. I do have a use case which requires max_entries=1 with a custom hash string setting str and int to None. Should I report the bug on GitHub or have you already created a ticket to track it? (EDIT: I see you added it while I was commenting)

Before that though, the hashing of the function is not done on the text of the function but rather uses the hash_func on each element within the function? So a string inside the function is converted using the hash_func to None? (EDIT2: I ask, since the two of my functions are very different but they seemed to share the cache)

Iā€™m sorry I couldnā€™t try out the check_hash function you provided. Where did you install hashlib from? The pip install gives me an error. Which library does _CodeHasher belong to?

The code is much too large to showcase all of it here, but this is the pseudo code for it.

# Sample is custom class which stores information for
# two related elements DNA and Protein
from sample import Sample

MY_HASH = {
    Sample: function_to_identify_unique_samples,
    str: lambda _: None,
    int: lambda _: None,
    list: lambda _: None
}

# The two functions are
@st.cache(max_entries=1, hash_funcs=MY_HASH, allow_output_mutation=True)
def preprocess_dna(sample, bool_arg, *other_str_int_list_args):
    # A time consuming function, hence caching is used.
    # Only to be called for a new sample or when the bool changes from False to True (This comes from a button in the interface).
    # It should only store the state of one sample at a time (Memory intensive otherwise), Hence max_entries=1
    # It should not be called when other arguments change. Hence the custom hash
    # The returned Dna object is mutated afterwards.
    # Since it returns an object which is used ahead in the pipeline, I have to call this function in every run.
    # Hence, it cannot be called only when the button is pressed i.e. if st.Button(): preprocess_dna(); does not work.
   
    return Dna


@st.cache(max_entries=1, hash_funcs=MY_HASH, allow_output_mutation=True)
def preprocess_protein(sample, bool_arg, *other_str_int_list_args):
    # Same as preprocess_dna but for protein

    return Protein

Now it turned out that even though there were differences between the functions, it did not store the cache properly as illustrated in the example I shared earlier.

Not sure how informative you found this. Let me know if you want to see the entire app though, Iā€™ll try to make it available.

For now, I solved the problem by using session states and storing the Dna and Protein objects in the session state and manually checking within the function for changes and if there are no changes, I return the object stored in the session state - though itā€™s not the cleanest solution.

Thanks,
Saurabh Parikh

Iā€™m sorry I couldnā€™t try out the check_hash function you provided

Ah my apologies, I updated the code snippet above to include the imports.

the hashing of the function is not done on the text of the function but rather uses the hash_func on each element within the function?

Yes, we hash the byte code of the function as well as the constants that are referenced by the bytecode.

So the strings, even though they are different, will be evaluated by the internal or user provided hash funcs, which in this case is hashing them to None.

For now, I solved the problem by usingā€¦

Aha gotcha, itā€™s an interesting case where you only want some of the function arguments to impact the function cache. The session state solution doesnā€™t sound so bad, given that even if the bug wasnā€™t there, youā€™d be using hash_funcs in an unusual way.

Might be worth a separate post in the forum to see if anyone has any alternate solutions that might be cleaner for you.

You could also continue using st.cache and hack around the bug by introducing an arg with a default value that isnā€™t an int, str or list.

@st.cache(max_entries=1, hash_funcs=MY_HASH)
def func1(bool_arg, str_arg, float_arg=1.1):
    st.write('1')

@st.cache(max_entries=1, hash_funcs=MY_HASH)
def func2(bool_arg, str_arg, float_arg=2.2):
    st.write('2')

Yes, I tried adding a random_state float argument to call the function only when needed. But then the issue I fell into was related to how buttons work - I couldnā€™t figure out how to call the function only when the variable changes from False to True, and use the cached value when it changes from True to False in the next run (Without using session states). So, maybe session states was the only viable option, but Iā€™m glad we found the bug.

Thank you @Jonathan_Rhone