I built an app using streamlit, and it was ridiculously easy to do the initial setup. However, I have been scrambling to optimize the application so that it can be used by many users at once while deployed on a fairly small EC2 instance on AWS (2CPUs and 2GB of memory).
Loading the data
Iāve found that I need to cache my basic load_data()
function in order to prevent the app from reading the data each time thereās a request.
The resulting dataset is around 20MB after optimizing dtypes. However, a single user sessions results in a 300MB memory usage and further sessions increment the usage further and further by various amounts above 20MB.
Filtering the data
I have stopped caching a filtering function which returns a fairly large object in order to optimize memory usage. The function doesnāt take too long to compute so I donāt mind the slight slowdown consequences of this.
Displaying the data
Once I have my data and have filtered it I generate 4 plots. By this point, the dataset can contain anywhere from 100k to just 2 data points depending on the filtering done. I am using maptlotlib
to generate the plots.
The issue
It feels like once 3 people start using the app, my server canāt handle the action and streamlit completely crashes (usually spitting out a āsegmentation fault (core dumped)ā error. CPU usage goes to 100% and memory usage increases significantly even if itās just a few users (sometimes up to 90%).
Proposed solutions
Here are some things I have in mind in order to optimize the performance, please do let me know if these are good ideas or there are better ways to optimize the app:
- Store data in a DB and simply cache the connection object and fetch the data according to the filters
- Stop using matplotlib and use something that forces javascript to render the plots
Link to the code: GitHub - jjdelvalle/grad_stats: GradCafe stats generator