I keep getting the same error returned when I use st.cache before a function doing a complex query & pandas manipulation. The original function queries an API for json data twice, and then a SQL database, puts them in pandas, does a few manipulations, and then joins all 3 into 1 dataframe. Several of the columns contain lists, which seem like it might be the problemā¦ But when I separated out the SQL query to itās own function (that Iām also attempting to cache) now itās seems to error on the SQL function. Iāve looked all over for documentation to resolve this and am having trouble finding the solution. Screen shot attached of the error.
Hello @ccoggins, welcome to the forums !
The documentation for Streamlit has just been updated with a page on Advanced caching, most notably a bit on caching a database connection
Itās hard to say without your original code, but my guess would be the code you are trying to cache which involves the SQL function opens a thread, and this thread cannot be hashed by Streamlit by default. You can configure the way Streamlit deals with caching this thread though, as a first try does it work with @st.cache(allow_output_mutation=True)
?
Thanks! I just got to sit down with the new documentation, and I updated a few things that seem to have been beneficial to my streamlit dashboard. I had stumbled onto allow_output_mutation=True
prior to my post, and it didnāt fix the issue. What I did do was separate the SQL engine connection into itās own function and cache it with allow_output_mutation=True
, and then passed that into the function that holds the SQL query and pandas manipulations. Even after that, the function containing the SQL query and pandas manipulations still returned the same error above.
What I also tried is turning off caching for this function, and the entire dashboard worked fine. I also tried passing in some of the arguments mentioned in the new documentation, and it didnāt resolve anything (some of the concepts are new to me, so itās likely I was doing them wrong). When I tried hash_funcs={DBConnection: id}
, it returned an error NameError: name 'DBConnection' is not defined
. After changing DBConnection to my engine connection object, it returns the same error I originally posted about. Itās probably a rookie mistake I canāt tell if DBConnection is a variable/object I should pass in, or an argument Streamlit looks for.
Hereās the function Iām having trouble caching, for context:
def get_songs(engine):
sql = āā"
SELECT *
FROM āpublicā.ātableā
LIMIT 10;
āā"
df = pd.read_sql(sql, con=engine)
df.id = df.id.astype(str)
df.col5 = [ā,ā.join(map(str, i)) for i in df.col5]
df.col5 = df.col5.replace({ātext1ā:ātext1 editā, ātext2ā:ātext2 editā}, regex=True)
# col2, col3, and col4 are list objects; col5 is a string
df[ācol6ā] = (df.col2 + df.col3 + df.col4 + df.col5.apply(lambda x: [x])).map(list).apply(lambda x: sorted(x))
df = df[[āidā, ācol6ā]]
return df
Just to be sure, I did a minimal example with sqlite3 so we can debug together ! Here is what I understood from your problem :
load_sqlite.py
import sqlite3
conn = sqlite3.connect("test.db")
conn.execute(
"""CREATE TABLE COMPANY
(ID INT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL);"""
)
conn.execute(
"INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (1, 'Paul', 32, 'California', 20000.00 )"
)
conn.execute(
"INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (2, 'Allen', 25, 'Texas', 15000.00 )"
)
conn.execute(
"INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (3, 'Teddy', 23, 'Norway', 20000.00 )"
)
conn.execute(
"INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) \
VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 )"
)
conn.commit()
print("Records created successfully")
conn.close()
You can run python load_sqlite.py
to create a local test.db
which contains a table.
app.py
import sqlite3
from sqlite3 import Connection
import pandas as pd
import streamlit as st
@st.cache(allow_output_mutation=True)
def get_connection(path):
"""Put the connection in cache to reuse if path does not change."""
return sqlite3.connect(path)
@st.cache(hash_funcs={Connection: id})
def get_data(engine):
"""If the Connection.id is the same as before, use the cached dataframe"""
sql = """
SELECT * FROM COMPANY;
"""
df = pd.read_sql(sql, con=engine)
return df
engine = get_connection("test.db")
data = get_data(engine)
st.write(data)
Run with streamlit run app.py
on NameError: name 'DBConnection' is not defined
, you need to import it in your Python script, itās an object/class.
- The
@st.cache(hash_funcs={Connection: id})
means Streamlit, each time you come accross aConnection
object in the cached method, when you compute the cache key, use id value of Connection and donāt try to use the rest of the Connection object as a cache key because it might change dynamically given other factors, like if thereās a Thread object within the Connection that we donāt know how to hash. Freely replaceConnection
byDBConnection
in your case
In my app.py
if I remove from sqlite3 import Connection
I get NameError: name 'Connection' is not defined
.
- you need to look for this
DBConnection
depending on the SQL client library you are using under SQLAlchemy. I only looked forfrom psycopg2.engine import DBConnection
for PostgreSQL and could not find it. What SQL client + SQL DB are you using ?
Wow, @andfanilo. This is such an amazing post, and that example you gave perfectly captures the new hash_funcs
workflow for database connections.
@ccoggins: The suggestion from @andfanilo is exactly correct. Please let us know if hash_funcs={Connection: id}
works for you. If so, we will consider making this the default behavior for sqlite3.Connection
.
If this suggestion doesnāt work, please know that the next version of Streamlit will include additional information so that we can diagnose and fix this problem.
Thank you for using Streamlit. Weāre excited to get this use-case working for you!
Sorry for the delayed response! @andfanilo your explanation of hash functions is great.
Being fully transparent, I donāt remember when I did it, but I did get it working. Itās still a little over my head so I could have it wrongā¦but I believe Iām mirroring your implementation using sqlalchemy. I Hereās a toy version of my implementation:
import pandas as pd
import sqlalchemy
@st.cache(allow_output_mutation=True)
def connect_engine(url):
engine = sqlalchemy.create_engine(url)
return engine
st.cache(hash_funcs={sqlalchemy.engine.base.Engine: id})
def get_data(engine):
sql = """
SELECT * FROM table;
"""
df = pd.read_sql(sql, con=engine)
return df
connection_url = "db_connection_string"
engine = connect_engine(connection_url)
data = get_data(engine)
st.write(data)
Great, glad you got it to work
Good now I know I can reuse that example everywhere !
Great, @ccoggins! This is a very common use-case and weād like to simplify it further in future releases. In the meantime though, thatās the exact right approach, and Iām so glad this worked for you.
Thanks for the kind words, @andfanilo!
Hey @ccoggins ,
Wanted to give you a quick update that we created a new GitHub issue to update the default behavior Adrien was mentioning and create a hash_func for type _thread._local.
This has been bothering me for so long. Thanks for helping me solve my problem!
Thanks so much for complementing the super helpful @andfanilo example with sqlalchemy variant, between both of you really helped me move beyond the hash challenge of the db connection as a streamlit newbie.
Well almost a year later, let me mark this as the solution happy Streamlit-ingā¢
Just chime in another data point. In my case for sqlite3, it works after I replace
@st.cache(hash_funcs={DBConnection: id})
with
@st.cache(hash_funcs={sqlite3.Connection: id})
@ccoggins. You saved my life! This unhashable error was killing me. My company, which shall remain nameless, to save it from public humiliation, is using ColdFusion to develop its web apps and I simply canāt abide by it! Streamlit is my only hope and you have helped keep hope alive, my friend!
For psycopg2 you can use
@st.cache(hash_funcs={psycopg2.extensions.connection: id})
Thanks! That worked for me as well
For psycopg2 you can use -
import streamlit as st
import psycopg2
import pandas as pd
# Initialize connection.
# Uses st.experimental_singleton to only run once.
@st.experimental_singleton()
def get_connection():
return psycopg2.connect(**st.secrets["postgres"])
@st.cache(hash_funcs={psycopg2.extensions.connection: id})
def get_data(engine):
"""If the Connection.id is the same as before, use the cached dataframe"""
query = 'select "Email" from emails_only;'
data = pd.read_sql(query, engine)
return data
def main():
engine = get_connection()
data = get_data(engine)
st.write(data)
if __name__ == "__main__":
main()
Hope it helps.