Invalidating and rebuilding caches during the night without user interaction

Hi, currently we run a streamlit v1.28 and python 3.11 based tool on our company’s local OpenShift platform. Our tool allows users to work with a large dataset that is cached with st.cache_data. Once the data is cached, the performance is fine but the initial load of the data takes several minutes (and only loading part of the data unfortunately is not an option).

Once a day the cache is invalidated (ttl is set to ‘24h’) in order to reload the dataset as it will most likely have changed significantly on the db and users need that latest state. Currently the rebuilding of the cache happens each day when the first user interacts with the tool, which results in them having to wait for a few minutes until they can use it.

Now my question: is it possible to configure Streamlit in a way to trigger that cache invalidation and – that’s the important bit – reloading of cached data say during the night without any user having to interact with the UI? If not, does anybody have a best practice solution how this can be achieved otherwise? Is it possible to e.g. fire a corresponding API call against the streamlit server’s URL?

Thank you

Hi @hitbyfrozenfire,

Thanks for posting!

I think, theoretically, you can use a cron job in OpenShift to run the data-loading script after the cache expires at midnight. This could be an interesting idea to try. If you can share some dummy code that mirrors your current implementation, that would be great for us to hack around a solution as well.

Thanks for your reply @tonykip

Locally (i.e. the streamlit server was http://localhost:8501) I tried running a get call against that server with requests.get which didn’t work as Streamlit pages are dynamically created (by JavaScript or TypeScript I assume).

Then I tried emulating a user accessing the app by making a “headless browser” get call using the selenium package combined with chrome/chromedriver (version numbers must match exactly) or firefox/geckodriver (here seems to be a bit more tolerance wrt versioning) in the background. This worked, the caches got invalidated and the data was reloaded :slight_smile:

The python code for this is rather simple, here for Firefox (Chrome works analogously):

from selenium import webdriver
import time

service = webdriver.FirefoxService(exceutable_path='[path_to_geckodriver]')
options = webdriver.FirefoxOptions()
with webdriver.Firefox(service=service, options=options) as driver:
    time.sleep(3) # time required depends on how long data load takes

Maybe the time.sleep could be replaced by a command that actually can check if the website has fully loaded but I haven’t figured that out yet. Also I haven’t been able to build a docker image or try it on openshift yet, but from that experience any such get call that can deal with dynamically created websites (guess something like pyppeteer or playwright should also work then but haven’t tried them) that can be fired towards the streamlit server seems to do the trick at least as a workaround hack.

1 Like

Oh cool. I think you’re mich further in trying to get a better implementation than I am at the moment.

Let me know if you find an optimal way.