Hi everyone,
I’m deploying a Streamlit app to Streamlit Cloud that uses SpaCy models (xx_ent_wiki_sm
, nl_core_news_sm
, fr_core_news_sm
) and Sentence-Transformers (LaBSE) via PyTorch.
The SpaCy models are already listed in my requirements.txt
using .whl
links.
Locally everything works fine, but on Streamlit Cloud, the app fails during startup with the following error:
2025-04-28 17:59:27,024 — INFO — Use pytorch device_name: cpu
2025-04-28 17:59:27,024 — INFO — Load pretrained SentenceTransformer: sentence-transformers/LaBSE
Traceback (most recent call last):
File "/mount/src/semartagger/pipeline.py", line 35, in <module>
nlp_ner = spacy.load("xx_ent_wiki_sm")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.12/site-packages/spacy/__init__.py", line 51, in load
return util.load_model(
^^^^^^^^^^^^^^^^
File "/home/adminuser/venv/lib/python3.12/site-packages/spacy/util.py", line 472, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'xx_ent_wiki_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
It seems like Streamlit Cloud is either wiping runtime downloads or not persisting the models correctly between deployments.
I’m also concerned that loading the LaBSE model via Huggingface / PyTorch might cause similar issues in the future (if runtime downloads aren’t reliable).
I’m still quite new to Streamlit and deployment, so apologies if this is a silly or basic question.
Is there a clean way to preload both SpaCy and Huggingface models on Streamlit Cloud? Should I manually upload models into the GitHub repo to guarantee availability?
Any best practices for handling large models during deployment would be really appreciated.
Thanks so much!
For reference this is the GitHub repo