Loading and caching of models and mutable objects

Hey! First, thanks for the great work on this ā€“ I love the Python-first approach and the thoughtful API.

Iā€™ve been trying to integrate spaCy, especially the built-in visualizers (see /usage/visualizers ā€“ Iā€™m only allowed to post 2 links here). Hereā€™s my progress so far and Iā€™ve managed to build an interactive app that loads a pre-trained model, processes a given text and generates different types of visualizations :tada:

The only thing Iā€™m still not really sure about is how to efficiently cache the loaded model (nlp) and the processed doc. At the moment, Iā€™m setting ignore_hash=True, but Iā€™m worried that this has unintended side-effects? I still occasionally see a ā€œmuted argumentsā€ warning.

In spaCy, the nlp object holds the loaded model weights, word vectors, vocabulary and so on ā€“ but itā€™s also mutable. Ideally you only want to be creating it once and then pass it around. (If Iā€™m writing a REST API, Iā€™d typically load all models once and store them in a global dict.) Same with the doc object: for each text the user enters, Iā€™d ideally want to create the object only once.

Whatā€™s the best way to solve this? Maybe thereā€™s also something obvious Iā€™m missing here ā€“ I really only just got started :slightly_smiling_face:

Hi Ines! Great questions. Iā€™ll answer inline below:

If Iā€™m writing a REST API, Iā€™d typically load all models once and store them in a global dict.

I donā€™t have any experience with spaCy, but if storing the model in a global dict is the right approach for you, then youā€™re in luck! Thatā€™s exactly what @st.cache does :smiley:

the nlp object holds the loaded model weights, word vectors, vocabulary and so on ā€“ but itā€™s also mutable.

Just to clarify on a technical point: itā€™s fine for the return value of an st.cacheā€™d function to be mutable. You just have to make sure you donā€™t actually mutate it outside that function.

Itā€™s a small distinction (and a bit of a nerdy technicality :nerd_face:) , but since in Python most objects are mutable I thought I might as well clarify!

But given what you said above about storing the model in a dict, my guess is the spacy.Language object (named nlp in your code) isnā€™t mutated when __call__ed (i.e. line 23 in your code, which is the only one that uses the model outside the cached function).

The only thing Iā€™m still not really sure about is how to efficiently cache the loaded model ( nlp ) and the processed doc . At the moment, Iā€™m setting ignore_hash=True , but Iā€™m worried that this has unintended side-effects? I still occasionally see a ā€œmuted argumentsā€ warning.

Assuming my guess above is correct, it should be safe to set ignore_hash to True. All that ignore_hash does is turn off the codepath that checks whether the output of a cached function was mutated. And that codepath is only there so we can show a warning to the user.

So ignore_hash doesnā€™t actually impact the caching itself.

I still occasionally see a ā€œmuted argumentsā€ warning.

This shouldnā€™t be happening. Can you report a bug next time you see this? Please include the warning you see on the screen plus a code snippet, if possible.


By the way, thanks for the kind words about Streamlit! Makes us feel all warm and fuzzy after we put so much work into it :hugs:

Thanks a lot, this makes sense!

Not in a way that matters, no. Techically speaking, processing a text has an impact on things like the tokenizer cache, which is part of the nlp object. But in this type of application, I know that once I have loaded a given model, I never wan to reload it.

I just changed my function to only take the model name instead of the loaded nlp object (returned by the cached function) and havenā€™t seen the warning since. So itā€™s possible that this was the underlying problem.

Thanks again for your help! :slightly_smiling_face:

Not in a way that matters, no. Techically speaking, processing a text has an impact on things like the tokenizer cache, which is part of the nlp object. But in this type of application, I know that once I have loaded a given model, I never wan to reload it.

Good to know!

spaCy looks great, by the way. Now I want to go play with it this weekend :smiley:

I just changed my function to only take the model name instead of the loaded nlp object (returned by the cached function) and havenā€™t seen the warning since. So itā€™s possible that this was the underlying problem.

Actually that makes a lot of sense. I forgot we also hash the input arguments, since thatā€™s how we know whether each function call is a cache hit or a cache miss.

That said, I canā€™t reproduce the issue. This doesnā€™t show a warning for me:

import streamlit as st
import spacy

@st.cache
def dummy_function(arg1):  # try to break Streamlit's argument-hashing
    return arg1  # try to break Streamlit's return-value-hashing

nlp = spacy.load('en_core_web_sm')
dummy_function(nlp)

Iā€™m glad the warning is not showing for you anymore, but if you see it again can you post it here? I really want to get to the bottom of thisā€¦

Thanks :smiley: If you do, let me know how you go! Maybe you also have some cool ideas for visualising and interacting with NLP models.

I just tested it again and hereā€™s a minimal example that consistently produces the ā€œCached function mutated its input argumentsā€ warning on load (at least for me):

import streamlit as st
import spacy


@st.cache(ignore_hash=True)
def load_model(name):
    return spacy.load(name)


@st.cache(ignore_hash=True)
def process_text(nlp, text):
    return nlp(text)


nlp = spacy.load("en_core_web_sm")
doc = process_text(nlp, "Hello world")

Awesome. Created a bug for this: https://github.com/streamlit/streamlit/issues/157

Thanks!

I ran your code snippet. spacy is so cool!! :sunglasses:

Youā€™re right that the caching optimization in nlp(text) is what Streamlit is detecting and complaining about. :grimacing: Thank you for bringing this to our attention!

Your provisional solution to make the nlp argument to process_text implicit via a closure works :tada: because Streamlit currently doesnā€™t perform mutation checks on such implicit inputs, although this likely will change in the future.

I think the eventual solution could be to give the user more control over Streamlitā€™s hashing functionality.

Thanks for bringing this to our attention and for using Streamlit. Please follow these issues to keep up-to-date on the solutions:

  1. Issue 153
  2. Issue 157
1 Like