App over its resource limits

Hello community,

This is my first topic in the forum! Let me thank you in advance for your responses to other questions that made my path deploying apps in Streamlit much easier.

Let me also thank you for the amazing work that this platform has made, making it possible for all us who focus on another kind of development that is not web related, to have a place where to publish our work.

I’m actually in a data science bootcamp, and we had a challenge related with machine learning models. I had to upload these models to github using joblib for a better compression that reduces its memory usage, but still very big files.

I wanted to make the app faster in the prediction, so I set up st@cache with all those files, some csvs and models, but still going over resource limits.

Here’s where I am now:

@st.cache(allow_output_mutation=True, ttl= 24*3600)
def load_ci_madrid_barcelona():
    return city_instance_mb.load_model_joblib(os.path.join(abs_path, "..", "resources", "models", "model_madrid_barcelona.gz"))

@st.cache(allow_output_mutation=True, ttl= 24*3600)
def load_ci_london():
    return city_instance_london.load_model_joblib(os.path.join(abs_path, "..", "resources", "models", "model_london.gz"))

@st.cache(allow_output_mutation=True, ttl= 24*3600)
def load_madrid_csv():
    return os.path.join(abs_path, "..", "resources", "datasets", "madrid.csv")

@st.cache(allow_output_mutation=True, ttl= 24*3600)
def load_barcelona_csv():
    return os.path.join(abs_path, "..", "resources", "datasets", "barcelona.csv")

@st.cache(allow_output_mutation=True, ttl= 24*3600)
def load_london_csv():
    return os.path.join(abs_path, "..", "resources", "datasets", "london.csv")

@st.cache(allow_output_mutation=True, ttl= 24*3600)
def create_instance_mb():
    d_csvs, d_names = dict(), dict()
    d_csvs["csvs1"] = [madrid, barcelona]
    d_names["names1"] = ["madrid","barcelona"]

    return ac.airbnb(d_csvs["csvs1"], d_names["names1"], "csv")

@st.cache(allow_output_mutation=True, ttl= 60)
def create_instance_london():
    d_csvs, d_names = dict(), dict()
    d_csvs["csvs2"] = [london]
    d_names["names2"] = ["london"]

    return ac.airbnb(d_csvs["csvs2"], d_names["names2"], "csv")

madrid = load_madrid_csv()
barcelona = load_barcelona_csv()
london = load_london_csv()

city_instance_mb = create_instance_mb()
city_instance_london = create_instance_london()

model_madrid_barcelona = load_ci_madrid_barcelona()
model_london = load_ci_london()

I set up different ttls, such as 60 or 600, but either seemed to work properly. I’m wondering if maybe I’m not understanding right how st@cache works.

All the information about the project is public here, if you need to check the entire code or something.

I couldn’t find another topic where this is solved and I really hope to be respecting every rule in the forum.

Thanks in advance for reading this topic.

I usually recommend not mutating things in cache. (It means the next user will not be starting from the same point.) You can use session state instead.

import streamlit as st
import pandas as pd

if 'df1' not in st.session_state:
    st.session_state['df1'] = pd.read_csv('my_big_data_file1.csv')

df1 = st.session_state['df1']

There is a newer, more efficient caching function st.experimental_memo but like I mentioned, with mutations allowed as it shows currently, I’d have to look at the rest of your code to see if that made sense. If the actual line of read_csv was taking a long time, you could do both:

import streamlit as st
import pandas as pd

def get_df1():
    return pd.read_csv('my_big_data_file1.csv')

if 'df1' not in st.session_state:
    st.session_state['df1'] = get_df1()

df1 = st.session_state['df1']

Hi @mathcatsand,

Thanks in advance for your suggestion.

We have tried both options, with st.experimental_memo and with st.session_state, and still getting to the same issue. As soon as two people join the web at the same time, boom.

Here’s the full code if you can check it entirely.

There is a 1GB resource limit on Streamlit Cloud (it’s a free service after all). Have you run it locally and looked at your actual memory usage? I had looked at your file sizes and thought it should be okay with one user, but yeah, I’m not sure how many more you can add to that concurrently since all the data will be held in memory for each user (and in the cache).

Is it possible to store the data using AWS instead of using cache?

Maybe this way we can make it possible to be available for more users.

Locally it works perfect.

We’ll be publishing different apps in the following months and we need to make them available for 15 to 30 users. Not necesarily using it at the same time.

Which are the possible alternatives for storing the data and use it in Streamlit?

Thanks in advance for each one of your responses.

You can certainly read data from a remote source instead of copying the data files into Streamlit Cloud with your code, but doing that would get you at most one more user, I’d think. I’m guessing it’s memory usage that’s the issue more than storage and a remote file source doesn’t really change that. Have you mapped out your memory consumption to see your biggest chunks so there’s something to focus on for efficiency?

I haven’t combed through every line, but one thing you will want to check is with caching: you don’t want to cache every little step along the way, but rather gather up as much as possible into the cached result that needs to be used. If you are caching anything large that’s really only used to feed into another cached function, you might be able to trim some of it.

Also, there is a newer version of st-folium with some efficiency gains, so you might try version 0.8.0 instead of 0.7.0 (just a side thought).

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.