Loading and caching of models and mutable objects

Hey! First, thanks for the great work on this – I love the Python-first approach and the thoughtful API.

I’ve been trying to integrate spaCy, especially the built-in visualizers (see /usage/visualizers – I’m only allowed to post 2 links here). Here’s my progress so far and I’ve managed to build an interactive app that loads a pre-trained model, processes a given text and generates different types of visualizations :tada:

The only thing I’m still not really sure about is how to efficiently cache the loaded model (nlp) and the processed doc. At the moment, I’m setting ignore_hash=True, but I’m worried that this has unintended side-effects? I still occasionally see a “muted arguments” warning.

In spaCy, the nlp object holds the loaded model weights, word vectors, vocabulary and so on – but it’s also mutable. Ideally you only want to be creating it once and then pass it around. (If I’m writing a REST API, I’d typically load all models once and store them in a global dict.) Same with the doc object: for each text the user enters, I’d ideally want to create the object only once.

What’s the best way to solve this? Maybe there’s also something obvious I’m missing here – I really only just got started :slightly_smiling_face:

Hi Ines! Great questions. I’ll answer inline below:

If I’m writing a REST API, I’d typically load all models once and store them in a global dict.

I don’t have any experience with spaCy, but if storing the model in a global dict is the right approach for you, then you’re in luck! That’s exactly what @st.cache does :smiley:

the nlp object holds the loaded model weights, word vectors, vocabulary and so on – but it’s also mutable.

Just to clarify on a technical point: it’s fine for the return value of an st.cache’d function to be mutable. You just have to make sure you don’t actually mutate it outside that function.

It’s a small distinction (and a bit of a nerdy technicality :nerd_face:) , but since in Python most objects are mutable I thought I might as well clarify!

But given what you said above about storing the model in a dict, my guess is the spacy.Language object (named nlp in your code) isn’t mutated when __call__ed (i.e. line 23 in your code, which is the only one that uses the model outside the cached function).

The only thing I’m still not really sure about is how to efficiently cache the loaded model ( nlp ) and the processed doc . At the moment, I’m setting ignore_hash=True , but I’m worried that this has unintended side-effects? I still occasionally see a “muted arguments” warning.

Assuming my guess above is correct, it should be safe to set ignore_hash to True. All that ignore_hash does is turn off the codepath that checks whether the output of a cached function was mutated. And that codepath is only there so we can show a warning to the user.

So ignore_hash doesn’t actually impact the caching itself.

I still occasionally see a “muted arguments” warning.

This shouldn’t be happening. Can you report a bug next time you see this? Please include the warning you see on the screen plus a code snippet, if possible.


By the way, thanks for the kind words about Streamlit! Makes us feel all warm and fuzzy after we put so much work into it :hugs:

Thanks a lot, this makes sense!

Not in a way that matters, no. Techically speaking, processing a text has an impact on things like the tokenizer cache, which is part of the nlp object. But in this type of application, I know that once I have loaded a given model, I never wan to reload it.

I just changed my function to only take the model name instead of the loaded nlp object (returned by the cached function) and haven’t seen the warning since. So it’s possible that this was the underlying problem.

Thanks again for your help! :slightly_smiling_face:

Not in a way that matters, no. Techically speaking, processing a text has an impact on things like the tokenizer cache, which is part of the nlp object. But in this type of application, I know that once I have loaded a given model, I never wan to reload it.

Good to know!

spaCy looks great, by the way. Now I want to go play with it this weekend :smiley:

I just changed my function to only take the model name instead of the loaded nlp object (returned by the cached function) and haven’t seen the warning since. So it’s possible that this was the underlying problem.

Actually that makes a lot of sense. I forgot we also hash the input arguments, since that’s how we know whether each function call is a cache hit or a cache miss.

That said, I can’t reproduce the issue. This doesn’t show a warning for me:

import streamlit as st
import spacy

@st.cache
def dummy_function(arg1):  # try to break Streamlit's argument-hashing
    return arg1  # try to break Streamlit's return-value-hashing

nlp = spacy.load('en_core_web_sm')
dummy_function(nlp)

I’m glad the warning is not showing for you anymore, but if you see it again can you post it here? I really want to get to the bottom of this…

Thanks :smiley: If you do, let me know how you go! Maybe you also have some cool ideas for visualising and interacting with NLP models.

I just tested it again and here’s a minimal example that consistently produces the “Cached function mutated its input arguments” warning on load (at least for me):

import streamlit as st
import spacy


@st.cache(ignore_hash=True)
def load_model(name):
    return spacy.load(name)


@st.cache(ignore_hash=True)
def process_text(nlp, text):
    return nlp(text)


nlp = spacy.load("en_core_web_sm")
doc = process_text(nlp, "Hello world")

Awesome. Created a bug for this: https://github.com/streamlit/streamlit/issues/157

Thanks!

I ran your code snippet. spacy is so cool!! :sunglasses:

You’re right that the caching optimization in nlp(text) is what Streamlit is detecting and complaining about. :grimacing: Thank you for bringing this to our attention!

Your provisional solution to make the nlp argument to process_text implicit via a closure works :tada: because Streamlit currently doesn’t perform mutation checks on such implicit inputs, although this likely will change in the future.

I think the eventual solution could be to give the user more control over Streamlit’s hashing functionality.

Thanks for bringing this to our attention and for using Streamlit. Please follow these issues to keep up-to-date on the solutions:

  1. Issue 153
  2. Issue 157
1 Like