How to properly optimize CPU and memory usage

I built an app using streamlit, and it was ridiculously easy to do the initial setup. However, I have been scrambling to optimize the application so that it can be used by many users at once while deployed on a fairly small EC2 instance on AWS (2CPUs and 2GB of memory).

Loading the data

I’ve found that I need to cache my basic load_data() function in order to prevent the app from reading the data each time there’s a request.

The resulting dataset is around 20MB after optimizing dtypes. However, a single user sessions results in a 300MB memory usage and further sessions increment the usage further and further by various amounts above 20MB.

Filtering the data

I have stopped caching a filtering function which returns a fairly large object in order to optimize memory usage. The function doesn’t take too long to compute so I don’t mind the slight slowdown consequences of this.

Displaying the data

Once I have my data and have filtered it I generate 4 plots. By this point, the dataset can contain anywhere from 100k to just 2 data points depending on the filtering done. I am using maptlotlibto generate the plots.

The issue

It feels like once 3 people start using the app, my server can’t handle the action and streamlit completely crashes (usually spitting out a “segmentation fault (core dumped)” error. CPU usage goes to 100% and memory usage increases significantly even if it’s just a few users (sometimes up to 90%).

Proposed solutions

Here are some things I have in mind in order to optimize the performance, please do let me know if these are good ideas or there are better ways to optimize the app:

  • Store data in a DB and simply cache the connection object and fetch the data according to the filters
  • Stop using matplotlib and use something that forces javascript to render the plots

Link to the code: GitHub - jjdelvalle/grad_stats: GradCafe stats generator

TL;DR if my server can’t handle a 20MB dataframe and matplotlib, would it be a better idea to just store the dataset in a DB and fetch results from streamlit?

Hi @imaginary, welcome to the Streamlit community!

I think the two quoted ideas above are good ideas to try, and you could also try to use pyarrow and memory mapping to avoid having to cache the data load.

https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a

Ultimately, a problem like this one takes a bit of trial and error…good luck!

Best,
Randy

1 Like

As quick update (in case anyone runs into this thread because of google):

Making sure you’re using the Renderer lock got rid of the problem for me completely. Memory usage was still pretty high, but now the app wasn’t completely crashing. It even sustained a mild reddit hug of death.

For the long term though:

  • I adapted my code to use plotly instead of matplotlib to take some of the edge off and just have the browser do most of the work. Plus you get to make the plots interactive.
  • Do garbage collection after each run of the script

These measures lowered my memory usage from ~50% to ~12.5%

How to delete the garbage after each run of the script?

Basically just how you usually would in python:

import gc
gc.collect()

However, there are some caveats. Some functionality may be compromised because of this, which is why it’s not done automatically for now by streamlit itself and which is why it might not be beneficial for everyone. Issue might be resolved soon as the developers are aware of the issue.