Caching update/reset at a particular time interval

Hi,

I am currently utilizing Streamlit to connect to serverless compute for data access. I have implemented caching to store data for a day, which functions as expected. However, the backend data is updated at 3:00 AM daily, while the Streamlit app reflects these updates much later, around 9:00 AM, due to the daily caching mechanism retaining older data.

Is there a method to maintain the daily caching while ensuring the cached data refreshes every day at 3:30 AM? I prefer not to use a manual button or option for clearing cache/updating the data.

This is code:

Hi @blackary, Can you please help me with this?

Hi @HIMANSHU21, I’m sorry, I haven’t been able to come up with a good way to get this to work. Are the queries slow enough that only caching them for an hour would lead to a poor user experience?

If so, my general recommendation would be to set up a pipeline to cache the data into a dedicated pre-processed table so that querying for the data is faster. Also, at least with snowpark, the fewer rows you return from the query, the better. So, if you’re doing something like:

  • Fetch 1M rows
  • Group them into 1k rows with Pandas

Then I would highly recommend trying to merge the grouping into the query itself, so that the query only has to create 1k objects in python, instead of 1M.

My general advice for cases where queries are the bottleneck are here: How to improve Streamlit app loading speed

Hi @blackary ,

Thank you for your response. The query execution time is approximately 20-25 seconds. I am utilizing Databricks to manage data and tables, connecting via serverless compute, which requires 3-4 seconds to initialize. Our application relies on 5-6 different tables to generate multiple pages, each dependent on distinct datasets. Moreover, we need all the data every time because the dashboard is based on historical data. This is what I am doing:

######################################################################

#Importing require libereries
import streamlit as st
import os
from databricks import sql

#Creating a function to connect to databricks SQL warehouse with caching to 1 day
@st.cache_resource(ttl=86400)
def get_connection():

    server_hostname=os.getenv('server_hostname'),
    http_path=os.getenv('http_path'),
    access_token=os.getenv('access_token')

    if not server_hostname or not http_path or not access_token:  
        st.error("access_token or http_path or access_token is not set.")  
        return None  

    else:
        try:
            connection = sql.connect(
                server_hostname=os.getenv('server_hostname'),
                http_path=os.getenv('http_path'),
                access_token=os.getenv('access_token')
            )
            return connection
        except Exception as e:
            st.error(f"An unexpected error occurred: {e}")
            return None
			
#Creating a function to fetch data from databricks SQL warehouse with caching to 1 day
@st.cache_data(ttl=86400)
def query_data(table_name):
    connection = get_connection()
    if connection is None:
        return None

    cursor = connection.cursor()
    catalog_name=os.getenv('catalog_name')
    gold_table=os.getenv('gold_table')

    #Create a query based on passed table_name argument
    query = f"SELECT * FROM {catalog_name}.{gold_table}.{table_name}"

    #Execute the query and convert it into pandas dataframe
    cursor.execute(query)
    df = cursor.fetchall_arrow().to_pandas()

    return df

def fetch_data(key, query_name, error_message):  
    try:  
        if key not in st.session_state:  
            st.session_state[key] = query_data(query_name)  
    except Exception as e:  
        st.error(f"{error_message}: {e}")  
        st.session_state[key] = None  

fetch_data(  
    key='data_table1',  
    query_name='gold_table1'
)  

fetch_data(  
    key='data_table2',  
    query_name='gold_table2' 
)  

fetch_data(
    key='data_table3',  
    query_name='gold_table3' 
)

fetch_data(
    key='data_table4',
    query_name='gold_table4'
)

fetch_data(
    key='data_table5',
    query_name='gold_table5'
)

########################################################

Please help me, how can I optimize this?

Also to resolve the original issue, I implemented something like this:

Let me know if you have some comments/suggestions.

Sorry, it’s difficult to optimize without knowing more about how you’re using the data. My only general advice is – unless you’re literally displaying the entire table of data to the user (and I mean the entire table, every column, every row), there’s likely opportunity to optimize by limiting the amount of data returned (group bys, selecting only certain columns, etc.).

You certainly know the data and your requirements better than I do, and I don’t have any streamlit-specific advice beyond that blog post I shared above.