How to tell which variables are being recomputed

Iā€™ve written a small app to explore logs in streamlit in only a few lines. Iā€™m pulling the logs from a database and then have a few fields to filter and explore. Itā€™s working great, but a little bit slow; Iā€™m afraid itā€™s re-running the database queries unnecessarily, but I donā€™t know how streamlit works well enough to know what is being re-run each time.

Is there some kind of debug mode that will tell me exactly what is being recomputed on each re-run. Something like that would be very useful!

Thanks!

Below is my code for this toy app. I have st.cache() over the functions that are fetching data, but the app is still slow to display the logs even when the data has been displayed before. (there arenā€™t many logs, so volume shouldnā€™t be the issue). My guess is that it is querying the DB unnecessarilyā€¦

import streamlit as st
import pandas as pd
import psycopg2

con = psycopg2.connect(dbname='dbname',
                       host='fhost',
                       port='0000', user='user', password='pw')
cur = con.cursor()


@st.cache()
def load_names():
    cur.execute("SELECT DISTINCT name FROM dev_all.ds_logs")
    return cur.fetchall()


@st.cache()
def load_data(app):
    cur.execute("SELECT * FROM dev_all.ds_logs WHERE name = '{}'".format(app))
    return cur.fetchall()


st.title('Logs Explorer')
apps = load_names()
app = st.selectbox("Select App", [str(app[0]) for app in apps])
data = load_data(app)
logs = pd.DataFrame.from_records(data)
st.write(logs)

Hi @timforr! Thatā€™s a great question. Weā€™ve been thinking about adding something like that to Streamlit for some time now, but we never created an actual feature request for it ā€” until now!

In the meantime, while that feature doesnā€™t materialize, I wrote a little class that can help with your app: https://gist.github.com/tvst/ad39fc29d69a933141c7a4564287cbf2

To use it, save it as timeit.py, initialize a t = TimeIt() object, then sprinkle t.tick() all over your app:

import streamlit as st
import pandas as pd
import psycopg2
import timeit

t = timeit.TimeIt()
con = psycopg2.connect(dbname='dbname',
                       host='fhost',
                       port='0000', user='user', password='pw')
t.tick('connected')
cur = con.cursor()
t.tick('got cursor')


@st.cache()
def load_names():
    cur.execute("SELECT DISTINCT name FROM dev_all.ds_logs")
    return cur.fetchall()


@st.cache()
def load_data(app):
    cur.execute("SELECT * FROM dev_all.ds_logs WHERE name = '{}'".format(app))
    return cur.fetchall()


st.title('Logs Explorer')
apps = load_names()
t.tick('loaded names')
app = st.selectbox("Select App", [str(app[0]) for app in apps])
data = load_data(app)
t.tick('loaded data')
logs = pd.DataFrame.from_records(data)
t.tick('got logs')
st.write(logs)
t.tick('wrote logs')

Each tick("some message") will add the input message to your app along with the time that ellapsed from the previous tick.

Let me know if this helps!

1 Like

Thanks for submitting the issue and writing the profiler!

Iā€™m still trying to understand the logic of streamlitā€™s re-computation strategy. It appears to be re-connecting to the database each time, and re-running the functions that were cached each time. Ideally I would want everything cached and not re-run, since it is not necessary.

Should I be structuring the application differently to make caching work? Like wrapping the connection and cursor code in a function with no arguments and caching that function as well? Thanks!

Oh, I see. Iā€™ll answer each question below:

It appears to be re-connecting to the database each time

That part of your code isnā€™t cached, right?

con = psycopg2.connect(dbname='dbname',
                       host='fhost',
                       port='0000', user='user', password='pw')

and re-running the functions that were cached each time. Ideally I would want everything cached and not re-run, since it is not necessary.

Streamlit reruns a cached function when either:

  1. The function body was edited
    ā€“ or ā€“
  2. The body of any (local) function used by your function changed
    ā€“ or ā€“
  3. It is called with input arguments it hasnā€™t seen yet.
    ā€“ or ā€“
  4. Any other variable used by your functions changed.

So in your case, you probably have to cache the con and cur objects too.

1 Like

Thanks Thiago, I now understand when a function will be re-run; it makes sense.

You are correct, the two cached functions were being cached properly and not rerunning. My mistake.

The connection establishment is being re-run each time, which was causing the responsiveness to be slow (about one second lag per action). Originally, I tried wrapping the connection establishment in a function and caching it, but I ran in to errors because the connection and cursor objects cannot be hashed. I tried adding ā€œignore_hash=Trueā€ argument, but still got an error.

import streamlit as st
import pandas as pd
import psycopg2


@st.cache()
def get_cursor():
    con = psycopg2.connect(dbname='dbname',
                           host='host',
                           port='0000', user='user', password='pass')
    return con.cursor()


@st.cache()
def load_names(cur):
    cur.execute("SELECT DISTINCT name FROM dev_all.ds_logs")
    return cur.fetchall()


@st.cache()
def load_data(cur, app):
    cur.execute("SELECT * FROM dev_all.ds_logs WHERE name = '{}'".format(app))
    return cur.fetchall()


cur = get_cursor()
st.title('Logs Explorer')
apps = load_names(cur)
app = st.selectbox("Select App", [str(app[0]) for app in apps])
data = load_data(cur, app)
logs = pd.DataFrame.from_records(data)
st.write(logs)

Error:

Streamlit cannot hash an object of type <class 'psycopg2.extensions.connection'>.,

**More information:**  to prevent unexpected behavior, Streamlit tries to detect mutations in cached objects so it can alert the user if needed. However, something went wrong while performing this check.

Please [file a bug](https://github.com/streamlit/streamlit/issues/new/choose).

To stop this warning from showing in the meantime, try one of the following:

* **Preferred:**  modify your code to avoid using this type of object.
* Or add the argument  `ignore_cache=True`  to the  `st.cache`  decorator.

I did try ignore_cache=True as well, but it didnā€™t even recognize the argument (pretty sure that wouldnā€™t fix the issue anyway) (perhaps the error message intended to suggest using ignore_hash, not ignore_cache?).

I donā€™t see a way to avoid using a connection object, and if I leave it as global it makes every action lag for 1 second as it reconnects.

Thanks for your help. I hope that resolving this issue can help others, because Streamlit is an awesome concept and Iā€™m excited to use it on a bunch of projects!

edit: I wonder if it would be useful to have a way to explicitly tell streamlit not to re-run certain variables or lines of code.

1 Like

Hey @timforr

Apologies for the delayed response, but weā€™ve been thinking about your use-case over here and have some updates for you :smiley:

First, youā€™re totally right that ignore_hash doesnā€™t work for your use case ā€” sorry for the confusion! For a second I thought ignore_hash would ignore input hashes, but actually it only ignores output ones (and thereā€™s a good reason for that. LMK if you want to hear it). To make this whole thing clearer, we have since renamed ignore_hash to allow_output_mutation

Second, to actually solve your problem weā€™re working on three things right now:

  1. Better error messages for st.cache, i.e. make out errors actually point to the correct part of the code :smile:. See PR #490 and issue #487
  2. Better fallbacks for object we donā€™t know how to hash. (I donā€™t have a link for this one yet. Weā€™re brainstorming this in Google docs)
  3. A nice escape hatch you can use to tell Streamlit how to hash objects it doesnā€™t handle properly. See #551

(1) and (3) should be landing on develop in a matter of days, and Iā€™m hoping they will be released in a week or two.


In the meantime, thereā€™s a nasty hack you can use to persist your database connection without using st.cache.

In Streamlit, when your script is re-executed we actually persist all Python modules whose source files havenā€™t changed. This means you can dump objects you want to persist into a module and use it in all reruns of your script.

For example:
(Note: this code is untested!)

import streamlit as st
import pandas as pd
import psycopg2

# Hack to share the connection object globally
# put staching it inside of a variable "global_con"
# inside the "st" module >_<
if not hasattr(st, 'global_con'):
  st.global_con = psycopg2.connect(
      dbname='dbname',
      host='host',
      port='0000', 
      user='user',
      password='pass')

# Grab the shared "global_con" object
con = st.global_con

@st.cache()
def load_names():
    cur = con.cursor()
    cur.execute("SELECT DISTINCT name FROM dev_all.ds_logs")
    return cur.fetchall()


@st.cache()
def load_data(app):
    cur = con.cursor()
    cur.execute("SELECT * FROM dev_all.ds_logs WHERE name = '{}'".format(app))
    return cur.fetchall()

st.title('Logs Explorer')
apps = load_names()
app = st.selectbox("Select App", [str(app[0]) for app in apps])
data = load_data(app)
logs = pd.DataFrame.from_records(data)
st.write(logs)

In the process I also removed the cursor object from the argument list in cached functions, since this could lead to weird behavior and funny race-conditions in multiple user scenarios.

3 Likes