Caching expensive database query with concurrent users

Hello everyone,

The setting

I have an application where Streamlit is querying a database for lots of data on a daily basis and caching these queries. Executing the query and reading data into memory takes ~20 minutes.

The issue

Now, let’s imagine that the cache just expired and 20 users simultaneously choose to load the app (could also be one clueless user trying to refresh the page again and again)… Since it takes 20 minutes to complete the function call which is cached, and thus 20 minutes to “populate the cache”, 20 identical queries will be made, and instead of running this function once, it will be run 20 times, which causes memory-overflow. I want to avoid this, but I’m not sure what the best approach is…

Possible solutions

  1. In the ideal scenario, Streamlit’s caching functionality would check whether this function call is currently in process of being cached, and wait for this to be complete. Would the new caching primitive st.experimental_singleton() potentially solve this?

  2. As soon as the function is hit, an entry is made in a dedicated table in the database indicating that this function call is currently being cached. The function should check this table in the database before executing, and if it finds that the value is currently being cached, it should provide an info message telling that data is currently being loaded (and then call st.stop()) and that the user should refresh the app to check if the data has been fully loaded.
    When the function completes, the entry in the database indicating whether the cache is being populated should then be set as “completed” or something similar. Issues with this approach is that it’s some work to implement, and there needs to be some mechanisms in place if the app dies while in the middle of the function call, which would leave the entry in the database as forever “loading into cache”.

  3. Other good ideas very welcome :nerd_face:

Thanks a lot for any help :slight_smile:

3 Likes

Great

@Kareem_Rasheed_babat
If you still have a need protecting a heavy computation from multiple users, we have currently implemented a “hack” for this in form of modifying streamlit/caching/cache_utils.py

The file in the gist is modified from streamlit v 1.6.0, and the modified section is from line 100-167.
We copy this file into the streamlit lib during a docker build.
If you use this for anything important i would suggest writing a test that checks the streamlit version number.

1 Like

Hi there @PeterT ,

Did you find a solution to this? I have the same challenge and wanted to know if your possible solutions (1) and (2) are finally feasible.

Also curious to see if anyone else has successfully dealt with concurrent users/open tabs (@manepal thanks for sharing yours - trying to avoid modifying the cache_utils.py file but will anyways review)

@marduk - Peter used to work at the same company as I am.
We currently use the modified cache_utils in our setup by simply copying it in to our container.
We have written a test to ensure we check for compatibility when we update the streamlit version.
The patch cache_utils works on versions1.6.0 to 1.9.0, we have not tested earlier versions of this.
We had another version of this on version ~ 0.89, before we updated to 1.6.0 and the new cache functions.
Hope this message helps :slight_smile:

1 Like

Thanks so much @manepal for clarifying, appreciate it. I will give it a shot in the next few days (might reach out with questions, hope that’s ok :slight_smile: ).

This is a great fix for a very real problem. Does streamlit plan on adopting this solution in a next version?