I’ve always heard that CPythons threading is very poorly usable because of the GIL and that in reality this means that spawned threads cannot truly work in parallel. Is that the case in streamlit or is GIL bypasses somehow?
Let’s say that I am using large language models in my app. I use st.cache_resource for them. Let’s say the app uses a model that takes up 1GB of memory. What is the impact of spawning different threads for different users? Is the model copied for every user, so for ten users we are now using 10GB of RAM. Or are they all referencing the same model (and thus computation is slow and not distributed at all)?
Bonus question: same as above but with GPUs. If a model is running on a GPU, is it copied N times for N users?
I’d love to know and also whether for the last part we control this one way or another. Either how often resources get parallellized or how we can scale streamlit+large ML models on a local server.
CPython threads cannot take advantage of multiple cores but processes can. I don’t think Streamlit uses processes (I might be wrong) but you can use them in your app.
Cached objects are shared across all users, sessions, and reruns. They must be thread-safe because they can be accessed from multiple threads concurrently.
But I don’t see how that relates to slowness. The code running the model might spawn processes to distribute the computation or even release the GIL and spawn threads that run simultaneously in several cores (some libraries like numpy can do that). So it depends.
See above, Streamlit by itself won’t make copies of the cached object. Your code (or library code that your code is calling) may or may not make such copies, but that is orthogonal to using Streamlit.
With respect to your answer to two: if streamlit indeed spawns new threads for new sessions, and users share the cached objects, then it is impossible for parallellism to take place. First because of GIL and second because they are all sharing the same ML model to run input through. So the different user sessions will all make use of just the single, cached and shared, instance of the ML model and hence is “slow”. Am I mistaken?
Parallelism is istil possible in the two ways I mentioned: spawning new processes and calling library code that releases the GIL. Any Python application can do it and Streamlit applications are not an exception.
So the different user sessions will all make use of just the single, cached and shared, instance of the ML model
Applications have access to that shared instance and can do anything with it, including making copies, if that is what worries you.
Thanks for stopping by! We use cookies to help us understand how you interact with our website.
By clicking “Accept all”, you consent to our use of cookies. For more information, please see our privacy policy.
Cookie settings
Strictly necessary cookies
These cookies are necessary for the website to function and cannot be switched off. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms.
Performance cookies
These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us understand how visitors move around the site and which pages are most frequently visited.
Functional cookies
These cookies are used to record your choices and settings, maintain your preferences over time and recognize you when you return to our website. These cookies help us to personalize our content for you and remember your preferences.
Targeting cookies
These cookies may be deployed to our site by our advertising partners to build a profile of your interest and provide you with content that is relevant to you, including showing you relevant ads on other websites.