New experimental primitives for caching

Help us test the latest evolution of st.cache!

Posted in Announcement, September 22 2021

Part of what makes Streamlit such a joy to use is its unique execution model: your code just executes from top to bottom like a simple script. No need to think about models, views, controllers, or anything of the sort. And what ties the whole thing together is a powerful primitive called st.cache. This is a decorator that allows you to skip long computations whenever your code re-executes.

However, over time we found that st.cache was the source of much confusion in the community. Our users would often be faced with cryptic errors like InternalHashError or UnhashableTypeError and dizzying solutions involving the likes of hash_funcs and allow_output_mutation.

So we set out to fix this!

Problems with st.cache

First, we decided to understand how st.cache was being used in the wild. A detailed analysis of open-source Streamlit apps indicated that st.cache was serving the following use-cases:

  1. Storing computation results given different kinds of inputs. In Computer Science literature, this is called memoization.
  2. Initializing an object exactly once, and reusing that same instance on each rerun for the Streamlit server's lifetime. This is called the singleton pattern.
  3. Storing global state to be shared and modified across multiple Streamlit sessions (and, since Streamlit is threaded, you need to pay special attention to thread-safety).

This led us to wonder whether st.cache's complexity could be a product of it trying to cover too many use-cases under a single unified API.

To test out this hypothesis, today we are introducing two specialized Streamlit commands covering the most common use-cases above (singletons and memoization). We have used those commands ourselves to replace st.cache in several Streamlit apps, and we're finding them truly amazing.

We'd like to share them with all of you in our amazing community to try out these two commands and tell us what you think.

Solution: st.experimental_memo and st.experimental_singleton

Let's examine how these primitives work.

st.experimental_memo

Use this to store expensive computation which can be "cached" or "memoized" in the traditional sense. It has almost the exact same API as the existing st.cache, so you can often blindly replace one for the other:

@st.experimental_memo
def factorial(n):
	if n < 1:
		return 1
	return n * factorial(n - 1)
f10 = factorial(10)
f9 = factorial(9)  # Returns instantly!

Properties:

  • Unlike st.cache, this returns cached items by value, not by reference. This means that you no longer have to worry about accidentally mutating the items stored in the cache. Behind the scenes, this is done by using Python's pickle() function to serialize/deserialize cached values.
  • Although this uses a custom hashing solution for generating cache keys (like st.cache), it does not use hashfuncs as an escape hatch for unhashable parameters. Instead, we allow you to ignore unhashable parameters (e.g. database connections) by simply prefixing them with an underscore.

For example:

@st.experimental_memo
def get_page(_sessionmaker, page_size, page):
	"""Retrieve rows from the RNA database, and cache them.
	
	Parameters
	----------
	_sessionmaker : a SQLAlchemy session factory. Because this arg name is
	                prefixed with "_", it won't be hashed.
	page_size : the number of rows in a page of result
	page : the page number to retrieve
	
	Returns
	-------
	pandas.DataFrame
	A DataFrame containing the retrieved rows. Mutating it won't affect
	the cache.
	"""
	with _sessionmaker() as session:
		query = (
			session
				.query(RNA.id, RNA.seq_short, RNA.seq_long, RNA.len, RNA.upi)
				.order_by(RNA.id)
				.offset(page_size * page)
				.limit(page_size)
		)
		
		return pd.read_sql(query.statement, query.session.bind)

For more information, check out this documentation on hashfuncs .

st.experimental_singleton

This is a key-value store that's shared across all sessions of a Streamlit app. This is great for storing heavyweight singleton objects across sessions (like TensorFlow/Torch/Keras sessions and/or database connections).

from sqlalchemy.orm import sessionmaker
@st.singleton
def get_db_sessionmaker():
	# This is for illustration purposes only
	DB_URL = "your-db-url"
	engine = create_engine(DB_URL)
	return sessionmaker(engine)
dbsm = get_db_sessionmaker()

How this compares to st.cache:

  • Like st.cache, this returns items by reference.
  • You can return any object type.
  • Unlike st.cache this decorator does not have additional logic to check whether you are unexpectedly mutating the cached object. That logic was slow and produced confusing error messages. So, instead, we're hoping that by calling this decorator "singleton", we're nudging you to the correct behavior.
  • You don't have to worry about hash_funcs! Instead, just prefix your arguments with an underscore to ignore them.
st.cachest.memost.singletonreturned items are stored by...referencevaluereferencefollows computation graphyesnonosupports mutating cached itemsYes, with special flag. But most worry about thread safety.noYes, but most worry about thread safety.

When should I use st.experimental_memo vs st.experimental_singleton?

We recommend using the following rule of thumb for these primitives:

  • Use st.experimental_singleton for storing non-serializable objects like TF sessions and/or DB connections which are created once and used multiple times.
  • Use st.experimental_memo for storing repeated computation utilizing serializable objects: dataframes, data objects, etc.

Reminder about our experimental process

The commands we're introducing today are experimental, and thereby governed by our experimental API process. This means, among other things:

  1. We reserve the right to change these APIs at any time. Indeed, that's the whole point of the experiment. 😉
  2. To make this clear, the names of these new commands start with "experimental_".
  3. If/when these commands graduate to our stable API, the "experimental_" prefix will be removed.

Wrapping up

These specialized memoization and singleton commands represent a big step in Streamlit's evolution, with the potential to entirely replace st.cache at some point in 2022. So please help us out by testing these commands in real apps and leaving comments in the Streamlit forums.

As usual, you can upgrade using the following command:

pip install --upgrade streamlit

Looking forward to hearing from all of you. Come by the forum or Twitter to share all the cool things you make! 🎈


This is a companion discussion topic for the original entry at https://blog.streamlit.io/new-experimental-primitives-for-caching/
1 Like

What legends you guys are! The singleton API came just in time to launch an internal app I built at work!
Awesome work guys, works like a charm!

2 Likes

so much love for the st.experimental_singleton!! performance gain like :rocket: speed. Many thanks

Hi, I’m an author of streamlit-webrtc.

I took a glance at these new caching mechanisms though, I think they cannot meet my current needs.

I’m using st.cache to keep the object identity returned from a factory function for identical input arguments over reruns.

My actual code is here: streamlit-webrtc/factory.py at e351066c5887e7d7ebeadda06c7d12c95b4bac24 · whitphx/streamlit-webrtc · GitHub

Example pseudo code is below:


obj_a = ... # Assume that the identity of this object is not changed over reruns until some event occurs. This is done by creating obj_a through a function wrapped by @st.cache or @st.singleton, or storing the object in the session_state.

obj_b = factory(obj_a) # !! How to memoize the factory function? !!

print(id(obj_b)) # I want to keep this identity of obj_b unchanged over reruns as long as id(obj_a) does not change.

st.memo cannot be used because it uses pickle() so seems not to preserve the returned object identity. In my case, additionally, obj_b is an instance of C-extension class, so cannot be pickled.

st.singleton cannot be used either because it cannot refresh the output identity when the input identity changes.

In other words, I want a Python version of ReactJS’s React.memo(), which accepts arbitrary dependency list and returns an identical object.
To do it, I’m using st.cache with ugly hack in streamlit-webrtc.

2 Likes

Can you create a proxy pickable object for the obj that’s not pickable, and hold the proxy in session state so it survives reruns. The proxy can be a guid which is used as a key to retrieve the actual obj value. When obj-a changes you generate a new guid, store it in session, and pass to the factory function which will give you a new obj-b, which you also store using the guid as a key. Wherever you want obj-b you need to resolve its value via a get(). Just an idea :slight_smile:

1 Like

Thank you very much for your suggestion!
I think it works,
however, using proxy object is essentially same to the hack I’m using in my current implementation with st.cache and its hash_func.

I hope it will be realized with a built-in solution :slight_smile:

Pretty cool that you are still improving the caching functionality, it’s one of the best features of streamlit in my opinion :smiley:

Have you ever thought about some kind of client side caching as well? I regularly work with pretty big dataframes and potentially there is some performance to gain there as well.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.