OnScriptRunnerEvent: ScriptRunnerEvent.SCRIPT_STARTED takes FOREVER

Hey,

For some reason from the moment I press on a select box, it takes a least 2 mins until the app is reacting.
As you can see in the log, all the “massive” calculating are cached.
I can’t figure out why it takes so long until the script hits the first cache key.

is there a way to debug it?

from debug log:
OnScriptRunnerEvent: ScriptRunnerEvent.SCRIPT_STARTED
Cache key: dd36fd91f1a2844c667016071d224d9d-b0b1175378fdba24717396d42c9f5294
Memory cache HIT: <class ‘pandas.core.frame.DataFrame’>
Cache hit: <function load_data at 0x115f73050>
Cache key: a413e7841eca8e0e1d04e16ec29312b3-efdd566dc08b76530338dc3d5c2d5dec
Memory cache HIT: <class ‘pandas.core.frame.DataFrame’>
Cache hit: <function load_csv_data at 0x1180c1f80>
Cache key: e50cb8a7c15c7e25a60754561669df01-b0b1175378fdba24717396d42c9f5294
Memory cache HIT: <class ‘pandas.core.frame.DataFrame’>
Cache hit: <function load_data at 0x115f73050>
Cache key: a413e7841eca8e0e1d04e16ec29312b3-efdd566dc08b76530338dc3d5c2d5dec
Memory cache HIT: <class ‘pandas.core.frame.DataFrame’>
Cache hit: <function load_csv_data at 0x1180c1f80>
OnScriptRunnerEvent: ScriptRunnerEvent.SCRIPT_STOPPED_WITH_SUCCESS
OnScriptRunnerEvent: ScriptRunnerEvent.SHUTDOWN

Can you share anything more from the code? Hard to tell what might be slow without knowing what’s running.

Otherwise, probably the best thing would be to add some logging in your code to see where the time is being spent.

i did some profiling but i don’t have nothing to compare with:

Ordered by: internal time

calls tottime percall cumtime percall filename:lineno(function)
56124599/55616905 11.828 0.000 13.042 0.000 {built-in method builtins.isinstance}
1022236/6 11.270 0.000 99.279 16.547 hashing.py:320(_to_bytes)
2510528/6 8.686 0.000 99.279 16.547 hashing.py:260(to_bytes)
12115469 8.633 0.000 8.633 0.000 type_util.py:50(get_fqn)
10503468 6.492 0.000 20.519 0.000 type_util.py:23(is_type)
2510528 5.789 0.000 30.028 0.000 hashing.py:193(_key)
11525699 5.476 0.000 13.581 0.000 type_util.py:57(get_fqn_type)
4162555 4.857 0.000 9.275 0.000 hashing.py:199(is_simple)
3136288 3.815 0.000 10.151 0.000 hashing.py:143(current)
3136288 3.239 0.000 4.022 0.000 weakref.py:432(get)
1691021 2.144 0.000 9.752 0.000 inspect.py:285(isroutine)
2210580/6 1.924 0.000 99.280 16.547 hashing.py:305(update)
2097659/2097219 1.721 0.000 1.847 0.000 {built-in method builtins.hasattr}
1598795 1.719 0.000 4.750 0.000 inspect.py:90(ismethoddescriptor)
3136363 1.676 0.000 2.315 0.000 threading.py:1225(current_thread)
721923 1.526 0.000 1.526 0.000 {built-in method _hashlib.new}
3382047 1.324 0.000 2.108 0.000 inspect.py:276(isbuiltin)
1022236 1.067 0.000 5.410 0.000 hashing.py:159(_is_magicmock)
2210585 1.031 0.000 1.031 0.000 {method ‘update’ of ‘_hashlib.HASH’ objects}
721920 1.011 0.000 1.011 0.000 {method ‘digest’ of ‘_hashlib.HASH’ objects}

Hello @amitg1,

As I see it in the debug log from your first post, there are multiple calls to load_data, load_csv and load_csv_data which appear to manipulate some DataFrame. To help you more, we’d need to know the kind of Dataframes you are manipulating, in which order you are calling your different functions in your source code, how you are caching them and which function is running the longest which you could check by printing timestamps at the beginning and end of the function for example.

It may be that some columns in the Dataframe spend a lot of time being hashed by the cache system, or being hashed multiple times for no reason…which is why the hashing.py is the most time consuming in your 2nd debug log. But it’s hard to prove without proper knowledge of the code you are running :slight_smile:

Would you be able to share a small reproducile example so we’re able to provide with better advice ?
Your Streamlit version, OS, Web browser used, Python version may also prove useful later on, but for now knowledge of what you are trying to achieve is more important for us !

Regards,
Fanilo

1 Like

soo,

it seems it takes about 30 seconds to generate the cache in memory:
Creating new mem_cache (key=d2f33ded8cdb4a109f84a184dc8127de, max_entries=inf, ttl=inf)
and another 30 sec to insert to it once my query is finished.
later on, once i want to use the same cache:
Cache key: 60b91bb6469e4ba8cac0dadbeecee70d-d2f33ded8cdb4a109f84a184dc8127de
Memory cache HIT: <class ‘pandas.core.frame.DataFrame’>
Cache hit: <function load_data at 0x11af193b0>

but after the cache hit, takes streamlit about 30 seconds to show the dataframe on screen.

that’s the function with the issues:

@st.cache
def load_data(sql: str) -> pd.DataFrame:
    """
    wrapper around pd.read_sql_query for caching and getting a session bind
    @param sql:
    @return: pd.DataFrame
    """
    with st.spinner('Loading Data...'):
        df = pd.read_sql_query(sql=sql, con=Session.get_bind())
        return df

usually, sql input str is sqlalchemy generated:

Query([Model]).statement = select * from a table with 1 row

when Model is:

import os

from sqlalchemy import Column, String, Integer, Date, Sequence, DateTime, func

class Model(Base):
    __tablename__ = f"{os.getenv('environment')}_dim_value_tests_models"
    __table_args__ = {'schema': 'deml'}
    NAME = "Model"

    model_id = Column(Integer, Sequence(f"{os.getenv('environment')}_model_id",
                                        metadata=Base.metadata), primary_key=True, autoincrement=True)
    version = Column(String(10))
    model_type = Column(String(20))
    train_date = Column(Date)
    model_owner = Column(String(30))
    comment = Column(String(30))
    model_owner_email = Column(String(30))
    insert_ts = Column(DateTime, default=func.now())

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

did anyone had a chance to check it out?

Okay!
got some progress.
seems that “sqlalchemy.sql.selectable.Select” object is causing all the mess.
currently testing:

@st.cache(hash_funcs={sqlalchemy.sql.selectable.Select: hash})
def load_data(sql: sqlalchemy.sql.selectable.Select) -> pd.DataFrame:
“”"
wrapper around pd.read_sql_query for caching and getting a session bind
@param sql: sql query
@return: pd.DataFrame
“”"
logger.info(“Running sql query…”)

df = pd.read_sql_query(sql=sql, con=Session.get_bind())
return df

so…
the app is running a lot faster, but i’m not getting cache hits while using the above function.
i also tested
@st.cache(hash_funcs={sqlalchemy.sql.selectable.Select: hash, Engine: lambda _: None})
maybe something with the engibe hash?

HELP

1 Like

Hi @amitg1 I drafted something here so you can try too.

See you there !

1 Like