Streamlit crashes when using Leveldb

Hello everyone,

I’m in need to use LevelDB with streamlit, using the plyvel wrapper.

LevelDB supports multithread access but it does not support multiprocessing.
It is extremely fast and quite common to store large datasets.

Is there a way to make it work inside streamlit?

I tried to use the cache mechanism, but it doesn’t change the result:

@st.cache
def get_db(dataset_root):
    db = LevelDB.get_instance(dataset_root)
    return db

Hi @luca,

I’m trying to recreate your issue but I’m unable.

Could you provide a full code example that I can run?

The following works OK for me

import streamlit as st
import plyvel

db = plyvel.DB('/tmp/testdb/', create_if_missing=True)

db.put(b'key', b'value')

st.write(db.get(b'key'))

Hi @Jonathan_Rhone,

Thank you for the reply!

The problem appears when one of the two happens:

  • There are interactive widgets
  • There are multiple users

This is a working example, to reproduce just click the checkbox while the progress bar is still filling up:

import streamlit as st
import plyvel

db = plyvel.DB('/tmp/testdb/', create_if_missing=True)

db.put(b'key', b'value')

st.checkbox('make it crash')

num = 100000
p = st.progress(0)
for x in range(num):
    a = db.get(b'key')
    p.progress(int(x/num * 100))

I investigated a bit the problem, I fear that the problem is that LevelDB doesn’t support multiprocessing, and probably streamlit uses processes to manage multiple users and interactive widgets.

Meanwhile I’m building a wrapper of plyvel that can either access the db directly or through some REST API of a local backend server, to avoid the lock problems.

Let me know if there are more efficient solutions!


Seems like the current process isn’t properly killed when the widget interaction starts a new process

1 Like

Hey @luca,

Thanks for the snippet!

I’ve resolved the issue with the use of st.cache and hash_funcs

import streamlit as st
import plyvel
import time

@st.cache(hash_funcs={plyvel._plyvel.DB: id})
def get_db():
    return plyvel.DB('/tmp/testdb/', create_if_missing=True)

db = get_db()

db.put(b'key', b'value')

st.checkbox('make it not crash :)')

num = 20

p = st.progress(0)

for x in range(num+1):
    time.sleep(.1)
    a = db.get(b'key')
    p.progress(int(x/num * 100))

If you encounter any further issues please reach out!

Hi @Jonathan_Rhone thank you very much!
I confirm that the code works as expected!


Although I’m not sure I fully understood how the cache mechanism works. I though that since the function does not have any parameters it would be called only once, the first time.

Is it doing some internal check to see if the returned object is mutated, and in case return a new object?

The same code works even with my wrapper, if I set hash_funcs={LevelDB: id} or allow_output_mutation=True :slight_smile:

Hi @luca,

I though that since the function does not have any parameters it would be called only once, the first time

Sorry I’m not sure I understand what you mean here. Are you referring to the get_db function? It will be called on the first run of the report, after which we’ll return the plyvel.DB() connection from the cache when it’s called.

Is it doing some internal check to see if the returned object is mutated, and in case return a new object?

Previous versions of Streamlit did this but as of version v0.53.0 we still do this internal check but we display a warning and return the cached version of the object instead of re-running the function and returning a new object.

https://github.com/streamlit/streamlit/blob/0.53.0/lib/streamlit/caching.py#L286

The same code works even with my wrapper, if I set hash_funcs={LevelDB: id} or allow_output_mutation=True :slight_smile:

I believe we disable hashing of the output if you set allow_output_mutation to True, which in this case negates the need to use hash_funcs to allow for the hashing of the plyvel.DB instance. However if you wanted to pass this db instance to another cached function as an input parameter, or to use the instance in the body of a cached function (not the return value), you would need to use hash_funcs as allow_output_mutation would not help in those scenarios. I would stick with hash_funcs either way as allow_output_mutation would be used as a hack rather than for its primary use case :slight_smile:

https://github.com/streamlit/streamlit/blob/0.53.0/lib/streamlit/caching.py#L373

2 Likes

Hi,

Thank you very much for your kind reply! :blush:

2 Likes

Hey guys,

I think I have a similar problem, I am trying to use multiprocessing with streamlit. I’ve had no trouble in the past using multiprocessing and streamlit together, but I can’t get it to work when I have a hashed database connection in my app.

Running my app serially runs fine. Running multiprocessing without the database connection works fine.

Also there are no database connection inside of multiprocessing, all of it is outside of the pool.

Database Connection:

@st.cache(hash_funcs={Connection: id})
def get_connection():
        """
        Put the connection in cache to reuse if path does not change between Streamlit reruns.
        NB : https://stackoverflow.com/questions/48218065/programmingerror-sqlite-objects-created-in-a-thread-can-only-be-used-in-that-sa
        """
        return sqlite3.connect("./database/solar_projects.db", check_same_thread=False)

Multiprocessing:

    st.write('multiprocessing')
        p = mp.Pool(processes=(2),
                    maxtasksperchild=1)
        results = p.map(run_autolayout, scenarios)
        p.close()
        p.join()