Understanding scaling and hardware usage


I was reading this thread: Does streamlit is running on a single-threaded development server by default or not? and found that for every user a separate thread is spun up on the CPU. I have a few questions regarding this:

  1. I’ve always heard that CPythons threading is very poorly usable because of the GIL and that in reality this means that spawned threads cannot truly work in parallel. Is that the case in streamlit or is GIL bypasses somehow?

  2. Let’s say that I am using large language models in my app. I use st.cache_resource for them. Let’s say the app uses a model that takes up 1GB of memory. What is the impact of spawning different threads for different users? Is the model copied for every user, so for ten users we are now using 10GB of RAM. Or are they all referencing the same model (and thus computation is slow and not distributed at all)?

  3. Bonus question: same as above but with GPUs. If a model is running on a GPU, is it copied N times for N users?

I’d love to know and also whether for the last part we control this one way or another. Either how often resources get parallellized or how we can scale streamlit+large ML models on a local server.


  1. CPython threads cannot take advantage of multiple cores but processes can. I don’t think Streamlit uses processes (I might be wrong) but you can use them in your app.

  2. That’s an easy one:

Cached objects are shared across all users, sessions, and reruns. They must be thread-safe because they can be accessed from multiple threads concurrently.

But I don’t see how that relates to slowness. The code running the model might spawn processes to distribute the computation or even release the GIL and spawn threads that run simultaneously in several cores (some libraries like numpy can do that). So it depends.

  1. See above, Streamlit by itself won’t make copies of the cached object. Your code (or library code that your code is calling) may or may not make such copies, but that is orthogonal to using Streamlit.

Thanks for the reply!

With respect to your answer to two: if streamlit indeed spawns new threads for new sessions, and users share the cached objects, then it is impossible for parallellism to take place. First because of GIL and second because they are all sharing the same ML model to run input through. So the different user sessions will all make use of just the single, cached and shared, instance of the ML model and hence is “slow”. Am I mistaken?

Parallelism is istil possible in the two ways I mentioned: spawning new processes and calling library code that releases the GIL. Any Python application can do it and Streamlit applications are not an exception.

So the different user sessions will all make use of just the single, cached and shared, instance of the ML model

Applications have access to that shared instance and can do anything with it, including making copies, if that is what worries you.

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.