Database example to replace BI tool

I want to test out replacing a BI tool with streamlit. I am currently querying a postgres database, capturing the data in a pandas dataframe, transforming the data, and visualising time series and histograms with Altair. There are multiple opportunities for caching. I haven’t yet found an example covering this use case, can anyone recommend one? Thanks in advance

2 Likes

This is a very similar use case to a dashboard application we have built for internal use. We use BigQuery as the backend database. The main, parameterized, query supports several different report pages and it is the dataframe returned for a particular parameter set that gets cached.

There are other opportunities for caching within the report analyses but we generally don’t use those - I tried it once and adding cached subsets of cached data along with the other state management is a great way to tie yourself into logical knots. Easier and usually almost as fast to have just a few pools of data. Also BigQuery itself caches query results for 24 hours so it is often just not worth getting too sophisticated.

The rule of thumb I use goes roughly like this:

  1. Information that is only used to rerun a report: use out of the box streamlit
  2. Information (for example filter settings or drill down) that needs to be transferred between reports: use SessionState or one of its variants
  3. Data that is accessed across collection of pages: Use caching, sometimes mutable, sometimes not - it depends on the specific use case

Recently we started incorporating another dataset from an on-prem SQL Server instance. That is only updated daily and it turns out to be easier to manage a persistent file cache of the dataframe explicitly (we use pandas.to_pickle() and gz). That is fast to load with minimal memory overhead.

1 Like

@knorthover, can you please share an example of how to properly cache the Pandas dataframe containing the BigQuery data? I am getting my dataframe from BigQuery as well, by using the below code, but when I try to use the @st.cache before my function definition, I get an error about hashing.

df = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)

@Greg
You didn’t show your actual function or specify what the error was.

The caching approach we settled on has several layers. The essential point is that we want to control when the cache refreshes - as we are running an online dashboard with real time background updates fed into BigQuery. We do that by wrapping a factory function that actually composes and runs the query in another function. The underlying factory function is never cached, that responsibility is solely with the wrapper. The crucial parameter is “at_time” which is a datetime string truncated to whatever precision is convenient for the application, in our case 5 minutes works well. You’ll notice that “at_time” isn’t passed through so it controls when the cache is refreshed on page reruns. The other parameters are for construction of the query.

@st.cache
def get_recordings_data(start_date: str = None, end_date: str = None,
            phrases: str = None, number: str = None, at_time: str = None,
            project: str = None, dataset: str = None, table: str = None
) -> pd.DataFrame:
"""Wrapper for :func:`load_metadata_for_period` for Streamlit caching

start_date and end_date should be ISO format strings in UTC 

"""
#print('get_recordings_data', phrases)
data = loaders.load_metadata_for_period(start_date, end_date,
            phrases=phrases, number=number,
            project=project, dataset=dataset, table=table)
return data

This wrapper function is then called by the page code which usually applies domain specific filters to the returned dataframe. The aim is that most filtering operations when a user is fine tuning their view of the data are done with Pandas in the Streamlit page run and the actual query to BigQuery is used to get the broadest reasonable superset for the page to work with.

The application generally doesn’t mutate the cached values, if we need to do that we usually just copy a working subset to a temporary local dataframe. In special cases, such as exceptionally large result dataframes that take a long time to hash, we’ll set the “allow_output_mutation” flag on the cache decorator to True. But we haven’t found that really necessary on our typical result set which is around 10,000 to 500,000 rows.

We’ve been using this approach since October 2019, it’s been stable and effective through the Streamlit releases.

1 Like

for exchange the data with database, here is a complete example:

@knorthover, my function looks like below…

def load_gbq_data(date):
query_string = “”" some functional gbq query “”"
df = (
bqclient.query(query_string)
.result()
.to_dataframe()
)
now = datetime.now()
return df, now

For background, my GBQ data is refreshed daily, so I’d like to simply query the GBQ data once when the app is opened, then use cached data for the rest of the interactions with the app that day.

So, when I load my data, I simply pass today’s date, which of course isn’t changing often when using the app (unless I use it around midnight :slight_smile: ).

todays_date = date.today()
df, now = load_gbq_data(todays_date)

However, the caching doesn’t seem to be working, based on the page refresh time when I move a slider filter, and when I display now datetime stamp also changes. Any ideas?

Nevermind! I forgot to add the little bit of Streamlit magic…

@st.cache

Works perfectly now. Thanks for the help Kevin and @randyzwitch!!! :joy:

Greg

It is always going to try to update the cache because you are collecting datetime.now inside the function and returning it as part of the result, which is also hashed. The now() function can never return the same result twice and so the output hash can never be invariant. Streamlit’s hash function has to be aware of those kind of issues. You should move all the code associated with now() outside the function you are caching.

Streamlit is a wonderful tool but the price is you have to be really careful when you do anything that requires the application to remember a state.

Kevin

Kevin Northover
Delta Dental of NJ

1 Like