Hash function error for uploaded text file

Hi ,
I am new to streamlit. While trying to upload a text file as shown in below code using @st.cache, I am getting error to use hash function on it. COuld anyone please help me to resolve this issue…

df_uploaded = st.sidebar.file_uploader(‘Choose txt file:’,type = [‘txt’,‘csv’])
@st.cache(show_spinner= True)
def load_data(file_uploaded):
return (pd.read_table(file_uploaded,header=None,encoding=‘utf-8’))
if df_uploaded:
temp = load_data(df_uploaded)

**UnhashableType** : Cannot hash object of type _io.StringIO

While caching some code, Streamlit encountered an object of type  `_io.StringIO` . You’ll need to help Streamlit understand how to hash that type with the  `hash_funcs`  argument. For example:

```
@st.cache(hash_funcs={_io.StringIO: my_hash_func})
def my_func(...):
    ...
```

Hi @santosh_boina,

You are hitting a case where Streamlit doesn’t know how to compute the hash for an object of type StringIO which is used inside pd.read_table on your file, so Streamlit knows if it has already computed and put into cache a similar uploaded file passing through StringIO. We need to indicate how to hash it through the hash_funcs argument.

Since it’s a string buffer, I think a first good way would be to download the content of the uploaded file and have Streamlit hash that, so if you upload the same file, it “checks” the content and if the content is the same as your previous uploaded file it doesn’t rerun the computation and fetches the computed pandas DataFrame.
You can download the whole file content through StringIO.getvalue so the following should work :

from io import StringIO

@st.cache(hash_funcs={StringIO: StringIO.getvalue})
def load_data(file_uploaded):
    ...

Unfortunately I think that makes the function read the full file twice, one for cache detection and one for the actual computation, and if the file is big that may be long. A better way would be to only read the beginning of the buffer in the hash_funcs instead of the whole file :slight_smile:.

If you are new to Streamlit and want to learn more about it (especially how it checks for objects it has already run on), then you may benefit from reading the Caching and Advanced caching for other techniques caching techniques.

Best of luck !

2 Likes

Thanks @andfanilo for explaining its usage. Its working now. But, when i use checkbox/selectbox/radio features to filter data to be shown as plots on tool, it reruns entire app.py script and starting from loading data again. Have I overlooked or missed something to include in my code to avoid this kind of problems.

My code is given here:

df_uploaded = st.sidebar.file_uploader(‘Choose txt file:’,type = [‘txt’,‘csv’])
@st.cache(hash_funcs={StringIO: StringIO.getvalue})
def load_data(file_uploaded):
return (pd.read_table(file_uploaded,header=None,encoding=‘utf-8’))
if df_uploaded:
temp = load_data(df_uploaded)
select_cols = st.selectbox('Choose column for analysis', list(temp.columns))
if select_cols:
** code to plot the variable distribution/scatter plot

Hey @santosh_boina :wave:,

This seems to be a bug on our side, I filed an issue for it here. Thanks for bringing it to our attention and feel free to follow/add any additional context in the GitHub issue!

Hi @santosh_boina,

I’m not sure if there is a bug here. Is the issue that your report is running slow, giving the impression that the data is being reloaded?

Perhaps this is due to your hash_funcs which uses getvalue to determine the hash, thus retrieving the entire contents of the file each time we call load_data.

Could you try sampling the data to determine the hash?

This should give you a speed improvement, however your hash will be slightly unreliable to determine true equality of the given StringIO object.

Apologies for replying late. Yes you are right, not any bug with features. It seems the problem is with data i am using for upload and hash function.

Returning the full text contents feels a bit heavy for a hash - will ST then be storing it internally as a key?

I’ve been using something like this for basing it on the MD5, which has been fine on 25-30MB files:

def hash_io(input_io):
    data = input_io.read()
    input_io.seek(0)
    if isinstance(data, str):
        data = data.encode("utf-8")
    return hashlib.md5(data).hexdigest()


@st.cache(hash_funcs={io.BytesIO: hash_io, io.StringIO: hash_io})
def load_data(file_data):
    try:
        return pd.read_csv(file_data)
    except:
        return pd.read_excel(file_data)
2 Likes

We should definitely be supporting BytesIO/StringIO directly in @st.cache (probably using @Ian_Calvert’s scheme here)! I’ll make sure that happens.

2 Likes

This is now being tracked as Github issue 1180. :v: Thanks for your help, all! :heart:

2 Likes

Hey all :wave:,

Supporting BytesIO/StringIO directly in @st.cache was officially merged in 0.57.0 :partying_face:. Also included in 0.57.0 is more detailed error messages for st.cache to help with debugging similar issues.

2 Likes