I built an app using streamlit, and it was ridiculously easy to do the initial setup. However, I have been scrambling to optimize the application so that it can be used by many users at once while deployed on a fairly small EC2 instance on AWS (2CPUs and 2GB of memory).
Loading the data
I’ve found that I need to cache my basic
load_data() function in order to prevent the app from reading the data each time there’s a request.
The resulting dataset is around 20MB after optimizing dtypes. However, a single user sessions results in a 300MB memory usage and further sessions increment the usage further and further by various amounts above 20MB.
Filtering the data
I have stopped caching a filtering function which returns a fairly large object in order to optimize memory usage. The function doesn’t take too long to compute so I don’t mind the slight slowdown consequences of this.
Displaying the data
Once I have my data and have filtered it I generate 4 plots. By this point, the dataset can contain anywhere from 100k to just 2 data points depending on the filtering done. I am using
maptlotlibto generate the plots.
It feels like once 3 people start using the app, my server can’t handle the action and streamlit completely crashes (usually spitting out a “segmentation fault (core dumped)” error. CPU usage goes to 100% and memory usage increases significantly even if it’s just a few users (sometimes up to 90%).
Here are some things I have in mind in order to optimize the performance, please do let me know if these are good ideas or there are better ways to optimize the app:
- Store data in a DB and simply cache the connection object and fetch the data according to the filters
Link to the code: GitHub - jjdelvalle/grad_stats: GradCafe stats generator