Streamlit App deployed as Azure WebApp for containers becomes unresponsive over time

Hi Streamlit

I’ve deployed my docker container to Azure WebApps for containers here https://awesome-streamlit.azurewebsites.net/.

It worked fine in the beginning but after some time it has now gotten slow and almost unresponsive. I’ve tried to redeploy without success. I’ve upgraded to a higher Price tier (B1) without success.

I’ve investigated the activity via the Azure portal

But it does not tell me much. (I’m not that experienced as DevOps Engineer :-))

It’s actually responsive again. And much more responsive as i’ve upgraded to a higher price tier.

I don’t know what the problem was. I had a lot of open tabs in Google Chrome and was opening and closing tabs while trying out the app. I had been working with the https://awesome-streamlit.azurewebsites.net/ site a lot. So my hypothesis is that somehow some connections are not closed or something.

Does that make sense at all?

1 Like

Now the problem is there again. It seems that app gets unresponsive over time.

I will investigate but if you have any ideas they are much appreciated.

For now I will try running the Docker container locally for a longer period of time to see if I can replicate locally.

You can also run it via

docker run -it -p 80:80 --entrypoint "streamlit" marcskovmadsen/awesome-streamlit:latest run app.py

Hi @Marc

A few things that are worth investigating are:

  1. Is there a memory leak? The simplest way to verify this hypothesis would be to look at a chart of memory usage vs time.
  2. Maybe the app is getting slow because too many users are using it at the same time. Some of the sub-apps that your awesome-streamlit app executes can be very CPU intensive. So it’s very likely one user’s session can starve another’s of CPU cycles (once enough users are connected).
    One way to verify this would be to look at 3 charts: average response time, # open websocket connections (or at least # of HTTP requests per time), and CPU usage. If the three charts spike at the same locations, this hypothesis looks strong.

If the problem is (1), the solution would be to debug and find the leak.

If it’s (2), a possible solution would be to replicate your app N times and put the replicas behind a load balancer. I haven’t used Azure WebApp containers before, but the marketing-speak on its official page makes it seem like they all have load balancers, and they should scale automatically. So I assume you just need to choose the min and max number of replicas you want, and Azure will do the rest.

By the way, this is one of the many reasons we think Streamlit For Teams will be useful: Machine learning apps are very CPU/GPU-intensive — especially when compared to normal web apps. And setting up the infrastructure to serve such demanding Python apps to the world is definitely not easy :nerd_face:

1 Like

Thanks @thiago

I was also wondering whether there are some settings in the config file that I have not set correctly. My config file is here

If you could take a look it would be much appreciated.

I’m actually in doubt what disables livereload as I guess should be disabled in Production. I don’t think that is well described in the file.

Thanks.

Memore usage is fine

CPU Usage is fine

Average response time is NOT OK :slight_smile:

image

Web socket connections I cannot find

I have now find some logs from the Docker containers and I can see that exceptions like the below are raised by Tornado.

Could that be the cause?

For example

2019-10-10T07:58:12.985172340Z
2019-10-10T07:58:12.985227741Z You can now view your Streamlit app in your browser.
2019-10-10T07:58:12.985240841Z
2019-10-10T07:58:12.985250742Z URL: http://0.0.0.0:80
2019-10-10T07:58:12.985260542Z

2019-10-10T13:08:58.001240095Z Future exception was never retrieved
2019-10-10T13:08:58.001303796Z future:
2019-10-10T13:08:58.001320297Z Traceback (most recent call last):
2019-10-10T13:08:58.001330897Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 874, in wrapper
2019-10-10T13:08:58.001345797Z yield fut
2019-10-10T13:08:58.001355898Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1133, in run
2019-10-10T13:08:58.001365798Z value = future.result()
2019-10-10T13:08:58.001375498Z tornado.iostream.StreamClosedError: Stream is closed
2019-10-10T13:08:58.001385399Z
2019-10-10T13:08:58.001395099Z During handling of the above exception, another exception occurred:
2019-10-10T13:08:58.001404899Z
2019-10-10T13:08:58.001414299Z Traceback (most recent call last):
2019-10-10T13:08:58.001424100Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1141, in run
2019-10-10T13:08:58.001434400Z yielded = self.gen.throw(*exc_info)
2019-10-10T13:08:58.001443900Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 876, in wrapper
2019-10-10T13:08:58.001453800Z raise WebSocketClosedError()
2019-10-10T13:08:58.001463401Z tornado.websocket.WebSocketClosedError

I also found this process list

It’s seems the CPU is running at 100%. I just don’t get that because there are not really any trafic to the container and the small things I is just showing some markdown, a table, a chart and then there is an NLP app that takes 1 second to run on my local pc.

But every proces item from 15:30 UTC to 17:30 UTC looks like that with 100% CPU.

I also found this chart with CPU Time. I don’t know how to relate to it though.

and the CPU usage

image

But again something i going in wrong according to the logs

2019-10-10T15:47:58.335547527Z Future exception was never retrieved
2019-10-10T15:47:58.335609329Z future:
2019-10-10T15:47:58.335626229Z Traceback (most recent call last):
2019-10-10T15:47:58.335636929Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 874, in wrapper
2019-10-10T15:47:58.335647430Z yield fut
2019-10-10T15:47:58.335657430Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1133, in run
2019-10-10T15:47:58.335667730Z value = future.result()
2019-10-10T15:47:58.335677630Z tornado.iostream.StreamClosedError: Stream is closed
2019-10-10T15:47:58.335687631Z
2019-10-10T15:47:58.335697131Z During handling of the above exception, another exception occurred:
2019-10-10T15:47:58.335706931Z
2019-10-10T15:47:58.335716431Z Traceback (most recent call last):
2019-10-10T15:47:58.335725932Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1141, in run
2019-10-10T15:47:58.335735932Z yielded = self.gen.throw(*exc_info)
2019-10-10T15:47:58.335745632Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 876, in wrapper
2019-10-10T15:47:58.335755832Z raise WebSocketClosedError()
2019-10-10T15:47:58.335765733Z tornado.websocket.WebSocketClosedError

2019-10-10T15:48:37.278627607Z Future exception was never retrieved
2019-10-10T15:48:37.278677208Z future:
2019-10-10T15:48:37.278701209Z Traceback (most recent call last):
2019-10-10T15:48:37.278712109Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 874, in wrapper
2019-10-10T15:48:37.278722209Z yield fut
2019-10-10T15:48:37.278731810Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1133, in run
2019-10-10T15:48:37.278741810Z value = future.result()
2019-10-10T15:48:37.278751310Z tornado.iostream.StreamClosedError: Stream is closed
2019-10-10T15:48:37.278760710Z
2019-10-10T15:48:37.278769811Z During handling of the above exception, another exception occurred:
2019-10-10T15:48:37.278779111Z
2019-10-10T15:48:37.278788111Z Traceback (most recent call last):
2019-10-10T15:48:37.278797111Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1141, in run
2019-10-10T15:48:37.278806612Z yielded = self.gen.throw(*exc_info)
2019-10-10T15:48:37.278815712Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 876, in wrapper
2019-10-10T15:48:37.278827412Z raise WebSocketClosedError()
2019-10-10T15:48:37.278836612Z tornado.websocket.WebSocketClosedError

2019-10-10T15:49:56.350452137Z Future exception was never retrieved
2019-10-10T15:49:56.350500139Z future:
2019-10-10T15:49:56.350513939Z Traceback (most recent call last):
2019-10-10T15:49:56.350524239Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 874, in wrapper
2019-10-10T15:49:56.350534539Z yield fut
2019-10-10T15:49:56.350544040Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1133, in run
2019-10-10T15:49:56.350554240Z value = future.result()
2019-10-10T15:49:56.350563740Z tornado.iostream.StreamClosedError: Stream is closed
2019-10-10T15:49:56.350573540Z
2019-10-10T15:49:56.350583041Z During handling of the above exception, another exception occurred:
2019-10-10T15:49:56.350592741Z
2019-10-10T15:49:56.350601941Z Traceback (most recent call last):
2019-10-10T15:49:56.350611341Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1141, in run
2019-10-10T15:49:56.350621042Z yielded = self.gen.throw(*exc_info)
2019-10-10T15:49:56.350630542Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 876, in wrapper
2019-10-10T15:49:56.350640242Z raise WebSocketClosedError()
2019-10-10T15:49:56.350649642Z tornado.websocket.WebSocketClosedError

2019-10-10T16:06:03.641717003Z Future exception was never retrieved
2019-10-10T16:06:03.641765704Z future:
2019-10-10T16:06:03.641779005Z Traceback (most recent call last):
2019-10-10T16:06:03.641805405Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 874, in wrapper
2019-10-10T16:06:03.641816706Z yield fut
2019-10-10T16:06:03.641826106Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1133, in run
2019-10-10T16:06:03.641835906Z value = future.result()
2019-10-10T16:06:03.641845306Z tornado.iostream.StreamClosedError: Stream is closed
2019-10-10T16:06:03.641854707Z
2019-10-10T16:06:03.641863907Z During handling of the above exception, another exception occurred:
2019-10-10T16:06:03.641873307Z
2019-10-10T16:06:03.641882307Z Traceback (most recent call last):
2019-10-10T16:06:03.641891608Z File “/usr/local/lib/python3.7/site-packages/tornado/gen.py”, line 1141, in run
2019-10-10T16:06:03.641901008Z yielded = self.gen.throw(*exc_info)
2019-10-10T16:06:03.641910008Z File “/usr/local/lib/python3.7/site-packages/tornado/websocket.py”, line 876, in wrapper
2019-10-10T16:06:03.641919408Z raise WebSocketClosedError()
2019-10-10T16:06:03.642059212Z tornado.websocket.WebSocketClosedError

I’ve run the docker container locally for 30 minutes and with one connection open in Chrome. When I start the app in docker it takes a fraction of a second to navigate from Home to the Vision Page.

After the Docker container has run for 30 minute it takes more than 8 seconds to navigate from Home to the Vision Page

Looking at the Task Manager I don’t see any CPU usage explosion. Its around 20% (of my laptops CPU) all the time.

Ooooh that’s really interesting.

I created a bug so someone here can investigate. Can you doubple-check that repro steps listed in the bug are correct?

FYI @thiago. I’ve added my comments to the repro steps.

I’ve created an issue on the awesome-streamlit repo

1 Like

FYI @thiago

Problems have now been solved and the performance is awesome!

Check it out at https://awesome-streamlit.org/

1 Like

That’s amazing!

Can you share what you did to fix it? Was there an issue with the app’s code, or a Streamlit issue, or something in between?

1 Like

I did a lot of changes. But I believe the essential parts are

  • I turned Always On on in Azure
    image
  • I created a script that runs inside my docker container to ping awesome-streamlit.org every 5 minutes
  • I removed the custom import_lib.reload functionality I was using (for development) to reload deeply nested modules. See the issue here https://github.com/streamlit/streamlit/issues/366
  • I set folderWatchBlacklist = ['']in the config.toml file.

In order to investigate the issue I had to add detailed logging to my docker continer, especially timestamp every line into the log and log top resource usage every 1 minutes. I have a request for time in the loggingFormatter here https://github.com/streamlit/streamlit/issues/447.

I believe item 1 and either 3 or 4 did the trick. But I have not reproduced. Just happy that it works after so many investigations.

I’ve recorded the full investigation here

2 Likes