How to properly optimize CPU and memory usage

imaginary · February 5, 2021, 4:52pm

I built an app using streamlit, and it was ridiculously easy to do the initial setup. However, I have been scrambling to optimize the application so that it can be used by many users at once while deployed on a fairly small EC2 instance on AWS (2CPUs and 2GB of memory).

Loading the data

I’ve found that I need to cache my basic load_data() function in order to prevent the app from reading the data each time there’s a request.

The resulting dataset is around 20MB after optimizing dtypes. However, a single user sessions results in a 300MB memory usage and further sessions increment the usage further and further by various amounts above 20MB.

Filtering the data

I have stopped caching a filtering function which returns a fairly large object in order to optimize memory usage. The function doesn’t take too long to compute so I don’t mind the slight slowdown consequences of this.

Displaying the data

Once I have my data and have filtered it I generate 4 plots. By this point, the dataset can contain anywhere from 100k to just 2 data points depending on the filtering done. I am using maptlotlibto generate the plots.

The issue

It feels like once 3 people start using the app, my server can’t handle the action and streamlit completely crashes (usually spitting out a “segmentation fault (core dumped)” error. CPU usage goes to 100% and memory usage increases significantly even if it’s just a few users (sometimes up to 90%).

Proposed solutions

Here are some things I have in mind in order to optimize the performance, please do let me know if these are good ideas or there are better ways to optimize the app:

Store data in a DB and simply cache the connection object and fetch the data according to the filters
Stop using matplotlib and use something that forces javascript to render the plots

Link to the code: GitHub - jjdelvalle/grad_stats: GradCafe stats generator

TL;DR if my server can’t handle a 20MB dataframe and matplotlib, would it be a better idea to just store the dataset in a DB and fetch results from streamlit?

randyzwitch · February 8, 2021, 5:25pm

Hi @imaginary, welcome to the Streamlit community!

I think the two quoted ideas above are good ideas to try, and you could also try to use pyarrow and memory mapping to avoid having to cache the data load.

Ultimately, a problem like this one takes a bit of trial and error…good luck!

Best,
Randy

imaginary · April 21, 2021, 10:53pm

As quick update (in case anyone runs into this thread because of google):

Making sure you’re using the Renderer lock got rid of the problem for me completely. Memory usage was still pretty high, but now the app wasn’t completely crashing. It even sustained a mild reddit hug of death.

For the long term though:

I adapted my code to use plotly instead of matplotlib to take some of the edge off and just have the browser do most of the work. Plus you get to make the plots interactive.
Do garbage collection after each run of the script

These measures lowered my memory usage from ~50% to ~12.5%

BeyondMyself · April 22, 2021, 8:49am

How to delete the garbage after each run of the script?

imaginary · April 22, 2021, 2:21pm

Basically just how you usually would in python:

import gc
gc.collect()

However, there are some caveats. Some functionality may be compromised because of this, which is why it’s not done automatically for now by streamlit itself and which is why it might not be beneficial for everyone. Issue might be resolved soon as the developers are aware of the issue.

Topic		Replies	Views
Cached data reloads slower than desired 🎈 Using Streamlit	3	320	March 5, 2024
Memory behavior 🎈 Using Streamlit cache	12	2309	March 7, 2024
Memory limits on using cache_data 🎈 Using Streamlit cache	2	1261	October 11, 2023
Strange RAM usage by the app 🎈 Using Streamlit	2	765	November 29, 2023
Streamlit with datasets up to 1 mil of rows 🎈 Using Streamlit cache , file-upload , pandas	4	1125	September 18, 2023

How to properly optimize CPU and memory usage

Loading the data

Filtering the data

Displaying the data

The issue

Proposed solutions

TL;DR if my server can’t handle a 20MB dataframe and matplotlib, would it be a better idea to just store the dataset in a DB and fetch results from streamlit?

Related Topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies