How to properly optimize CPU and memory usage

I built an app using streamlit, and it was ridiculously easy to do the initial setup. However, I have been scrambling to optimize the application so that it can be used by many users at once while deployed on a fairly small EC2 instance on AWS (2CPUs and 2GB of memory).

Loading the data

I’ve found that I need to cache my basic load_data() function in order to prevent the app from reading the data each time there’s a request.

The resulting dataset is around 20MB after optimizing dtypes. However, a single user sessions results in a 300MB memory usage and further sessions increment the usage further and further by various amounts above 20MB.

Filtering the data

I have stopped caching a filtering function which returns a fairly large object in order to optimize memory usage. The function doesn’t take too long to compute so I don’t mind the slight slowdown consequences of this.

Displaying the data

Once I have my data and have filtered it I generate 4 plots. By this point, the dataset can contain anywhere from 100k to just 2 data points depending on the filtering done. I am using maptlotlibto generate the plots.

The issue

It feels like once 3 people start using the app, my server can’t handle the action and streamlit completely crashes (usually spitting out a “segmentation fault (core dumped)” error. CPU usage goes to 100% and memory usage increases significantly even if it’s just a few users (sometimes up to 90%).

Proposed solutions

Here are some things I have in mind in order to optimize the performance, please do let me know if these are good ideas or there are better ways to optimize the app:

  • Store data in a DB and simply cache the connection object and fetch the data according to the filters
  • Stop using matplotlib and use something that forces javascript to render the plots

Link to the code: GitHub - jjdelvalle/grad_stats: GradCafe stats generator

TL;DR if my server can’t handle a 20MB dataframe and matplotlib, would it be a better idea to just store the dataset in a DB and fetch results from streamlit?

Hi @imaginary, welcome to the Streamlit community!

I think the two quoted ideas above are good ideas to try, and you could also try to use pyarrow and memory mapping to avoid having to cache the data load.

https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a

Ultimately, a problem like this one takes a bit of trial and error…good luck!

Best,
Randy

1 Like