BLOCKING BUG - memory and ephemeral storage leak issue in a Kubernetes deployment

Hi team,

I would love your help, having a hard time deploying a Streamlit Plotly dashboard on Kubernetes and making it stable / performant (even though it’s small data).

Summary:

  • Issue: Memory and ephemeral storage leaks are causing pod eviction and restarts in a Kubernetes deployment.
  • Symptoms: Memory usage is stable at around 250 MB but experiences sudden spikes to 10 GB after multiple user sessions, leading to pod restarts.
  • Current Workaround: Implementation of a replicaset to maintain dashboard availability; however, this is not a long-term solution.
  • Logs: There are no specific error logs, only indications of pod eviction and restarts.

Deployment Details

  • Caching Strategy: Utilizing cache_resource to share heavy objects across sessions with max_entries set to 2.
  • Memory Profiling: Added memory profiling across the application but unable to replicate the issue locally.

Streamlit Version: 1.30.0
Poetry Dependency Management,pyproject.toml:

* [tool.poetry.dependencies]
  * python: ">=3.9,<3.9.7 || >3.9.7,<4.0"
  * click: "^8.1.7"
  * python-box: "^7.1.1"
  * pyyaml: "^6.0.1"
  * tqdm: "^4.66.1"
  * plotly: "^5.18.0"
  * numpy: "^1.26.3"
  * pandas: "^2.1.4"
  * snowflake-connector-python: "^3.6.0"
  * streamlit: "^1.30.0"
  * streamlit-authenticator: "^0.2.3"
  * aiohttp: "^3.9.1"
  * msal: "^1.26.0"
  * requests: "^2.31.0"
  * python-pptx: "^0.6.23"
  * matplotlib: "^3.8.2"
  * numerize: "^0.12"
  * streamlit-aggrid: "0.3.4.post3"
  * google-auth: "^2.26.2"
  * google-cloud: "^0.34.0"
  * python-dotenv: "^1.0.0"
  * google-cloud-bigquery: "^3.16.0"
  * db-dtypes: "^1.2.0"
  * memory-profiler: "^0.61.0"
  * openpyxl: "3.1.2"
* [build-system]
  * requires: ["poetry-core"]
  * build-backend: "poetry.core.masonry.api"

Hey @Eden_B,

We have a blog post here that goes over how to find the cause of a memory leak using tracemalloc – have you tried that?

Hi @Caroline, I went trough this document. I see the behavior of a memory leak in step 1, but traces do not seem to indicate an issue. I see references to buttons and widgets that are not released, but am doubtful that these lingering objects are the source of the pod eviction. Here’s an example of the outputs after clicking through five of the pages:

After 5 runs the following traces were collected.

{ "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/config.py:131": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/cursor.py:106": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/elements/widgets/button.py:476": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/elements/widgets/button.py:485": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/elements/widgets/button.py:486": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/elements/widgets/button.py:607": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/runtime/media_file_manager.py:225": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/runtime/media_file_manager.py:45": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/runtime/runtime.py:639": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/runtime/state/widgets.py:152": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/runtime/state/widgets.py:153": 5, "/Users/brownian/Library/Caches/pypoetry/virtualenvs/dashboards-wd3bjAKn-py3.9/lib/python3.9/site-packages/streamlit/watcher/polling_path_watcher.py:83": 5 }

For my understanding, what does Streamlit use ephemeral/disc storage for? Are cached and pickled objected stored on disc or in memory? I am caching a fairly large amount of data transformations, however my max_entries is always set to 1 to avoid collecting and saving duplicates.

@Ian_B To lower the memory usage, there are two quick things you could try:

  1. Use this wheel file which fixes a memory leak. This leak was fixed last week and will be released with 1.32.
  2. Deactivate the backend storage of forward messages in config.toml via (requires at least 1.30):
    [global] 
    storeCachedForwardMessagesInMemory = false
    
    This is a bit of an old artifact that most likely doesn’t have any use. And its a bit of a problem if there is a spike of user sessions.

These two aspects might help with the issue, but there are other memory inefficiencies we are currently investigating. To give you more specific help, it would be great if you can tell us which of the following features your app is using:
st.file_uploader , st.image , st.video , st.audio , st.pyplot, st.download_button , long running sessions using st.rerun , large dataframes/charts, or any of the caching decorators?

For my understanding, what does Streamlit use ephemeral/disc storage for? Are cached and pickled objected stored on disc or in memory?

I think everything in a normal Streamlit setup is stored in memory. However, you can configure certain aspects to store on disk (e.g. via st.cache_data(persist="disk")). But you are probably not doing that, or?

1 Like

Also, if you want to take a deeper look into memory usage, there are two quite helpful libraries: guppy3 and objgraph.

E.g. you could add this to your app which will give you quite good ways to do memory debugging during the run:

import gc
import random
import resource

import objgraph
import streamlit as st

if st.button("Show heap"):
    import psutil
    from guppy import hpy
    gc.collect()
    heap = hpy().heap()
    # This shows the actual heap currently used by the app:
    st.text(heap)
    process = psutil.Process()
    # This shows the reserved memory, but this doesn't mean it uses this amount of memory currently:
    st.write("RSS memory (bytes):", process.memory_info().rss)
    # This shows the peak size used during app lifetime:
    st.write("Max RSS memory (bytes):", resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

if st.button("Show most common object types"):
    gc.collect()
    # Shows the top 100 object types that are currently in the memory:
    st.dataframe(objgraph.most_common_types(100))

if st.button("Show object type growth"):
    # growth since the last execution of this code:
    gc.collect()
    st.dataframe(objgraph.growth())

obj_type = st.text_input("Object type", value=None)
if st.button("Explore type"):
    st.write("Leaking obj from type", objgraph.count(obj_type, objgraph.get_leaking_objects()))
    # Get the backref chain for a random object from this type:
    st.write("Backref chain")
    st.write(objgraph.find_backref_chain(
        random.choice(objgraph.by_type(obj_type)), 
        objgraph.is_proper_module))

object_rank = st.number_input("The n-largest object", min_value=0, value=0)
if st.button("Show n-largest object path"):
    from guppy import hpy
    heap = hpy().heap()
    obj = heap.byid[object_rank]
    st.write(f"Object {object_rank}: ", "Path:", obj.sp, "Info:", obj.stat)
3 Likes