Frustrated: Streamlit frontend disconnects from apps with long-running processes

Kladar · April 6, 2021, 6:53pm

I have a streamlit app (built in docker, deployed to heroku) that formulates an optimization (a linear program) and then sends the schedule to a gurobi server. However, the process takes a while, and thus for a 20-year optimization at the hourly level, I chunk it into years (8760 data points for a year, x20 years).

Basically, once the inputs are set by the user and saved as state variables a la Joel Grus’ Game State hacks , the process is:

Formulate inputs and constraints to optimization for 2021 →
send to gurobi →
get back optimal solution from gurobi as a dataframe->
display on Streamlit app “Optimal Solution for 2021 Complete! Moving on to 2022…”
Repeat for next iteration, for 20 iterations.
Concatenante all 20 dataframes from the process and display some summary results and some download and “Send to database” buttons for the results.

Each iteration takes about a minute. However, I can only make it to about the 10th iteration (sometimes less, sometimes more) of that before the streamlit frontend “disconnects” from the running loop process of the app and resets all of my state variables, etc, like I had just reloaded the page. The background app process still continues (it’s still looping through the iterations - it doesn’t know streamlit got disconnected), but there is no way to reconnect to the running process again. Without streamlit, this process is fine to run locally with scripts, but with streamlit the frontend is so unreliable as to be unusable for this app. Nothing more frustrating than to get nearly done with a process that took 20 minutes and then it resets itself, forcing the user to start the process over.

I’m really not sure what to try at this point, other than build something much more robust like a flask app that sends these jobs to a job container, which then sends back the result when done. Running the optimization on the same computer as the frontend seems like a fool’s errand.

Anyone have tips to keep a streamlit frontend connected to a long-running process without resetting?

thiago · April 7, 2021, 4:06am

Hi @Kladar

That sounds super frustrating. Sorry about that

In my experience those symptoms sound a lot like there’s some unexpected HTTP timeout going on on the host side. The only thing that gives me pause here is that the default HTTP timeout is ~30s, so usually that’s how long your app would take between disconnect-reconnect cycles. But in your case it sounds like these reconnect-disconnect cycles are much longer…

Either way, to help debug it would be great to get a couple of things:

If you look at your server logs, do you see any errors?
On the browser-side, if you open your browser’s dev tools, do you see anything either on the JS console or in the Network tab?

If you don’t know what to look for, or just want a second pair of eyes, feel free to post the logs from (1) and a HAR file from (2) here so I can help debug.

Kladar · April 8, 2021, 12:34am

Thanks for the response, sorry about the late reply. I’ve recreated the error and made a screen recording, and I did so running the app locally to avoid the confounding variable of the heroku-based deployment, which I played up too much in the initial post. It actually is more likely to fail running locally, though I’m not sure why. For a 20 year run (~10 minutes), it resets the app about 60% of the time somewhere along in the process (I had to record a couple times to show a failure, as the first couple examples made it through the analysis successfully). Link to recording

I have a HAR file for a later failure (not the one in the video) but the forum won’t let me attach it. Should I convert it to some other file type? We don’t need the Heroku logs because since I ran it locally. And now that I think about it, it could be a memory thing - since python is pretty bad at memory management and Chrome is a RAM hog, perhaps running it several times successfully caused it to fail? Though I CTRL+C stopped the streamlit app and started it again each time and my memory usage on my computer seemed pretty static. I’m at a loss. ::

Kladar · April 20, 2021, 9:36pm

@thiago

thiago · April 21, 2021, 6:24pm

Hey @Kladar , thanks for the video!

After watching the video I tried reproducing the issue locally using a toy example, but I’m just not able to

This is the toy example I tried:

import streamlit as st
import time
import datetime

"""
# Long-running app example
"""

start_time = datetime.datetime.now()

for i in range(20):
  "---"
  "Loop number:", i
  "Ellapsed time:", datetime.datetime.now() - start_time

  # Tried with
  # time.sleep(10)
  # ...but can't repro

  # Tried with
  # time.sleep(60)
  # ...but can't repro

At least this rules out a few hypotheses I had about websocket keepalive failures…

To help debug:

Can you put the HAR file in this private Google Drive folder I shared with you?
Can you repro the bug with streamlit run --logger.level=debug script_name.py 2> bug.log and upload the log from bug.log to Google Drive?
Can you try using a Chrome profile that has no Chrome extensions installed?

Kladar · April 21, 2021, 10:06pm

Thanks for the help. I’ve uploaded the .HAR file and the bug.log from the most recent run to that folder (at least I believe it’s that folder - let me know if you can’t see them). I ran it again locally in a chrome profile with no extensions and it failed at about the same point as in the video.

I believe the failure happens around line 2977 of the log. You can see ~ line 2958 “optimization is solved with status 2” which is good, that means that year completed (Year 1 of 20), and is saving the results as a dataframe (see the pandas warning about setting a value on a slice from a df blah blah etc). Then on line 2977 we get “Server state: State.ONE_OR_MORE_BROWSERS_CONNECTED → State.NO_BROWSERS_CONNECTED” which I believe is our issue.

I will attempt to run it to completion and post what a successful log file looks like as well, as there is a lot going on in that log.

(Note, neither here nor there but I also upgraded to streamlit 0.8.0 before that run and the log files attest to me being excited about using secrets! )

thiago · April 22, 2021, 11:58pm

Thanks, this is super helpful!

I believe the failure happens around line 2977 of the log.

Yup, that’s right.

The logs around that line make me think that the problem is on the browser side. We basically see the websocket disconnecting the same way it would if the browser had terminated the connection normally. Very odd!

Unfortunately, the data in the HAR file starts a few seconds after the websocket first connects, which means the websocket connection (and, more importantly, its disconnection!) isn’t present in the data

Can you try capturing a HAR file using the method below?

Start your Streamlit server streamlit run foo.py
Open a browser at localhost:8501
Open the dev tools in the browser > Network > “preserve log”
Reload the browser page with Ctrl-R (or Cmd-R)
Now save the HAR file and upload to our shared Drive link

Also: can you copy/paste into a file whatever you see in the Console tab in the dev tools?

If none of this bears fruit, our next best step is to set up a video call and see what’s up!

Kladar · April 26, 2021, 11:20pm

Okay, @thiago I’ve hopefully captured everything in a couple newly uploaded .HAR files from similarly failed runs.
There is one from just recording after the app has started, and one from a full refresh, and both gave similar errors when the app reset itself.

In failed_run_attempt2, there is a 101 code on ~line 71, that followed 304s at line 61 and 70, that all seemed to be generated just as the app reset itself.

The file failed_run_from_full_refresh is the similar, but started a bit “further back” from a Cmd-R Refresh point. We see the 304 and 101 code combo around line ~131. Hopefully that helps disentangle what is normal process vs what is failure.

And here is the full dev tools console capture for failed_run_attempt2 (also uploaded to the shared drive). Thanks for working through this with me!

thiago · April 28, 2021, 5:25pm

Ok, I’m still stumped. I’ll reach out by email and try to set up a live debug with you.

(Will post any findings here for anyone who’s watching!)

Kladar · April 28, 2021, 5:32pm

Awesome, thanks for the help.

thiago · April 29, 2021, 5:56pm

Update: we still don’t know what’s going on, but we were able to find a possible solution.

In Streamlit’s sever/server.py, change this:

    "websocket_ping_interval": 20,  # Ping every 20s to keep WS alive.

to this

    "websocket_ping_interval": 1,  # Ping every 1s to keep WS alive.

In @Kladar’s case, this appears to solve the problem, and I don’t expect it to have any noticeable negative impact.

I’ll talk to our eng team to see if it makes sense to make this the default, as it’s very possible that I’m missing something obvious

Kladar · May 18, 2021, 9:28pm

@thiago curious for status on this?

I see the 0.82 garbage collection which is much appreciated, though assuming that is not the fix here.

Kladar · June 1, 2021, 6:20pm

@thiago fortnightly bump

avila196 · June 11, 2021, 7:36pm

Hello @thiago !
I’m having exactly the same issue as @Kladar. I have a model that takes almost an hour to run and sometimes it just disconnects from the backend process while I see the console is still running but the front is just reset to default. Is there a more friendly solution already for this?

Thanks in advance!

tim · June 22, 2021, 7:32pm

Hey @Kladar and @avila196 - I’m taking a look at this now. It’s obviously a trivial change; the question is, will it have other consequences that we’re not considering?

I’ll do a bit more research - but the plan right now is to merge this change in the next 2 weeks.

github.com/streamlit/streamlit

Change websocket ping time to 1s to improve connection resilience?

opened 06:00PM - 29 Apr 21 UTC

tvst

bug next candidates

### Summary The websocket connection drops consistently for [this user](https…://discuss.streamlit.io/t/frustrated-streamlit-frontend-disconnects-from-apps-with-long-running-processes/11612). In `sever/server.py`, when we change this: ``` "websocket_ping_interval": 20, # Ping every 20s to keep WS alive. ``` ...to this: ``` "websocket_ping_interval": 1, # Ping every 1s to keep WS alive. ``` ...the problem goes away. **Can we make this the default?** ### Steps to reproduce We couldn't find a toy scenario to repro this :( ### Is this a regression? Don't think so ### Debug info - Streamlit version: 0.80.0 - Python version: (get it with `$ python --version`) - Using Conda - OS version: - Browser version:

Kladar · July 13, 2021, 11:00pm

Update here?

tim · July 13, 2021, 11:48pm

Hey @Kladar - I made the change in Merge pull request #3464 from tconkling/tim/DecreaseWebSocketPing · streamlit/streamlit@20a636f · GitHub

And it landed in Streamlit release 0.84.0

Kladar · July 14, 2021, 12:02am

Amazing thanks! I will check it out.

natoucs · April 14, 2022, 8:11pm

I experience exactly the same problen as @avila196 with streamlit==1.8.0
Any solution ?

Kladar · April 14, 2022, 10:06pm

It could be a memory leak. We haven’t had the problem since the websocket ping patch in 0.84, but we also haven’t upgraded to 1.8 yet. Saw this recently, maybe it will help: 3 steps to fix app memory leaks

Topic		Replies	Views
Full script run times are very inconsistant Using Streamlit	3	807	February 7, 2024
Make apps faster by moving heavy computation to a separate process Show the Community!	8	5472	May 13, 2025
App refreshes when processing a model output (>10 mins) Community Cloud streamlit-cloud	3	924	February 26, 2023
App Speed Improvements Using Streamlit windows , heroku	2	646	August 15, 2022
Multiprocessing issue in Streamlit Using Streamlit	2	1508	April 30, 2024

Frustrated: Streamlit frontend disconnects from apps with long-running processes

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies