Each time I tweet about a streamlit application I just built, the application is down shortly after. The tmux session console just says: killed. Until I restart the server, I cannot even connect with putty to relaunch the app. It’s an AWS-EC2 micro AMI server, which is ok for most of the time, but not for short term heavy load. Does anyone have any experience on how to solve this problem without necessarily buying a bigger server? I don’t mind if the application crashes from time to time. I just would like it to recover. Also, I have no idea what is happening, all monitoring graphics look ok, and I just think there must be too many sessions, but maybe somebody knows a better way to analyze the problem. This might help me to find a solution. Thanks for any insight.
Hi @godot63 -
From the sound of it, it does sound like this is related to having a quick burst of traffic. Are you using anything like systemd to do auto-restarting?
I’ll have to ask internally whether we have any load testing metrics or guidelines around concurrent sessions.
hi @randyzwitch Randy
thanks for your reply. It is the burst for sure. I believe that one of the side-effects of the caching mechanism of streamlit is, that the apps use up a lot of memory, depending on the amount being cached. There is not much you can do: more memory, less caching, or, as you suggest live with it but make sure the application restarts. I want to go with the last solution and will try systems. I’m now to Linux so that is another challenge I have here. Since AWS calls everything elastic, I wonder if there is no elastic memory, I am willing to pay a bit for a few hours after a tweet when the app is really busy. For that I would like to analyse what the requirements would be. if you come up with any answers regarding testing and metrics I would be most grateful.
It might be the case you are running up against one of our cache design decisions, outlined here in the documentation:
It sounds like what you might be seeing is that your AWS instance is undersized for every possible combination of your app inputs, but it’s only noticeable when many users hit it. Meaning, perhaps it takes 1000 runs to overload the cache, which you don’t notice individually, but when 1000 users in one hour do it, then it causes the issue. If that’s the case, then there’s nothing to do but make a bigger AWS instance.
You might also try using the ttl keyword argument on
st.cache(), but that’s more about managing the freshness of the result than memory management:
ttl (float or None) – The maximum number of seconds to keep an entry in the cache, or None if cache entries should not expire. The default is None.
Thanks again @randyzwitch. I now start my apps from a script below that will restart the process if it is killed. The line [ -e stopme ] && break causes a break if a file named stopme is found in the folder. However, I am still interested to know if there is a way to stress-test an app. On the other hand I will check with AWS if there is a way to blow up the memory on demand for a limited amount of time.
#!/bin/bash # starts the traffic app and restarts it if crashed while true; do [ -e stopme ] && break streamlit run app.py --server.enableCORS False --server.port 8502 done
I’ll take this back to our engineering team, and see if they have any suggestions. Off the top of my head, using something like Selenium might be possible and enumerate the possible input values to your app:
My app runs o a dedicated server on nginx. Nobody except me is using it at the moment. When I do especially hard tests streamlit seems to crash and I get 506 error from nginx. I need to restart streamlit (run streamlit myapp) and all is fine again.
What is the best way to monitor that streamlit is up and running and to restart it if not?