Slow update of data and image loading

snowformatics · June 29, 2023, 5:37am

Summary

I have created an app which loads a CSV file hosted on GitHub and 5 images hosted on imagekit. Every day the CSV file gets updated with new data and new images are uploaded to imagekit.

Steps to reproduce

After new data was added to the CSV stored on GitHub, the streamlit app does not update the latest version of the file. Sometimes it helps to start and stop the app. Second the image loading is very slow (5 images about 400kb).

Code snippet:

import streamlit as st
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import date
import datetime
import streamlit.components.v1 as components
import calendar

st.set_page_config(layout="wide")

st.title('Meteor Scattering')


col1, col2 = st.columns([2,2], gap="medium")



DATE_COLUMN = 'date/time'
#DATA_URL = ('https://s3-us-west-2.amazonaws.com/'
            #'streamlit-demo-data/uber-raw-data-sep14.csv.gz')
DATA_URL = ("https://raw.githubusercontent.com/snowformatics/SuperMeteor/master/supermeteor/test.csv")
DATA_URL_top5 = ("https://raw.githubusercontent.com/snowformatics/SuperMeteor/master/supermeteor/image_out.csv")

#DATA_URL = ("test.csv")

today = date.today()

f = '%Y-%m-%d'
f2 = '%H%M%S'
f3 = f + f2
pd.set_option('display.max_columns', None)


def get_top5_meteors(data):
    data['h'] = pd.to_numeric(data['h'], errors='coerce')
    data['w'] = pd.to_numeric(data['w'], errors='coerce')

    # Group the DataFrame by 'date'
    grouped_df = data.groupby('date')

    # Extract the two largest 'w' and 'h' values per date
    largest_objects_per_day = grouped_df.apply(lambda x: x.nlargest(5, ['w', 'h'])).reset_index(drop=True)
    return largest_objects_per_day

@st.cache_data
def load_data(nrows):
    data = pd.read_csv(DATA_URL, delimiter='\t', dtype={'time':str})



    data['date'] = data["timestemp"].str.slice(stop=10)
    data['date'] = pd.to_datetime(data['date'], format=f)
    data['time'] = pd.to_datetime(data['time'], format=f2)
    data['date'] = data['date'].astype(str)
    data['time'] = data['time'].astype(str)
    data['time'] = data['time'].str.slice(10)

    data[DATE_COLUMN] = pd.to_datetime(data['date'].astype(str) +
                                          data['time'].astype(str))

    return data


def ChangeWidgetFontSize(wgt_txt, wch_font_size = '12px'):
    htmlstr = """<script>var elements = window.parent.document.querySelectorAll('*'), i;
                    for (i = 0; i < elements.length; ++i) { if (elements[i].innerText == |wgt_txt|) 
                        { elements[i].style.fontSize='""" + wch_font_size + """';} } </script>  """

    htmlstr = htmlstr.replace('|wgt_txt|', "'" + wgt_txt + "'")
    components.html(f"{htmlstr}", height=0, width=0)


#data_load_state = st.text('Loading data...')
data = load_data(10000)
largest_objects_per_day = get_top5_meteors(data)

#print (largest_objects_per_day)

with col1:
    date_input1 = st.date_input(
        "Choose a date",
        datetime.date(today.year, today.month, today.day-1))
    # Filter by date
    st.subheader('Number of meteors by hour')

    data2 = data[data['date'] == str(date_input1)]
    df_stats2 = data2[['h', 'w']].describe(include="all").transpose()
    if st.checkbox('Show raw data', key=2):
        st.subheader('Raw data')
        st.write(data2)
    if st.checkbox('Show statistics', key=3):
        st.subheader('Statistics')
        st.write(df_stats2)
    hist_values1 = np.histogram(data2[DATE_COLUMN].dt.hour, bins=24, range=(0,24))[0]
    st.bar_chart(hist_values1)

with col2:
    st.subheader('Top 5 Meteor images')
    top_meteors = pd.read_csv(DATA_URL_top5, header=None)
    top_meteors.columns = ['url']
    top_meteor_list = []
    for index, row in data2.iterrows():
        id_all = row['image_file'][0:25]
        for index1, row1 in top_meteors.iterrows():
            id_top5 = row1['url'].split('/')[4][0:25]
            if id_all == id_top5:
                if id_top5 not in top_meteor_list:
                    top_meteor_list.append(id_top5)
                    st.image(row1['url'],width=600)
    st.bar_chart(hist_values2)

ChangeWidgetFontSize('Show raw data', '22px')
ChangeWidgetFontSize('Show statistics', '22px')
ChangeWidgetFontSize('Choose a date', '22px')

If applicable, please provide the steps we should take to reproduce the error or specified behavior.

Expected behavior:

The app should show the latest data as soon as the CSV file on GitHub gets updated.

Actual behavior:
Data and images are not updated.

Debug info

Streamlit version: streamlit 1.24.0
Python version: 3.9
Conda
Windows 10
Browser version: Chrome 114.0.5735.90

Requirements file

Links

Link to your GitHub repo: GitHub - snowformatics/SuperMeteor
Link to your deployed app: https://snowformatics-supermeteor-meteor-app-344cin.streamlit.app/

Notes

Data are available in the CSV file until 28.6.2023 but the app show data till 26.6.2023. Select a date before 26.6.2023 to see some data.
Thanks

tonykip · June 29, 2023, 6:43pm

Hi @snowformatics,

Thanks for posting and welcome to Streamlit Community Forum!

One thing that stands out from your code is the use of st.cache_data in your load_data function.

You have set the value of nrows to 1000 for every time the function is called which means it will keep returning what is cached rather than fetch new data from GitHub. Here’s where you’re calling this:

#data_load_state = st.text('Loading data...')
data = load_data(10000)
largest_objects_per_day = get_top5_meteors(data)

Since you want the app to fetch fresh data everyday, you can use the ttl param in st.cache_data as such:

@st.cache_data(ttl="24h")
def load_data(nrows):
    data = pd.read_csv(DATA_URL, delimiter='\t', dtype={'time':str})
    data['date'] = data["timestemp"].str.slice(stop=10)
    data['date'] = pd.to_datetime(data['date'], format=f)
    data['time'] = pd.to_datetime(data['time'], format=f2)
    data['date'] = data['date'].astype(str)
    data['time'] = data['time'].astype(str)
    data['time'] = data['time'].str.slice(10)

    data[DATE_COLUMN] = pd.to_datetime(data['date'].astype(str) +
                                          data['time'].astype(str))

    return data

This way, it will make sure that the cached data expires every 24 hours and the data is fetched afresh. Just make sure that the initial fetch matches the time you want the cache to expire and be rerun.

Let me know if this is of nay help.

snowformatics · July 3, 2023, 5:57pm

Hi @tonykip ,

thanks for your reply and sorry for my late response. I am getting this error is I try to run your code:

   link.expires = time + self.__ttl
TypeError: unsupported operand type(s) for +: 'float' and 'str'

If I just use simple caching it response faster:

@st.cache_data
def load_data():

Any idea why the image loading is so slow? I will try to reduce the image resolution.

tonykip · July 3, 2023, 8:35pm

Hi @snowformatics,

Try setting the ttl value using timedelta, e.g. ttl=datetime.timedelta(hours=24):

@st.cache_data(ttl=datetime.timedelta(hours=24))
def load_data(nrows):
    data = pd.read_csv(DATA_URL, delimiter='\t', dtype={'time':str})
    data['date'] = data["timestemp"].str.slice(stop=10)
    data['date'] = pd.to_datetime(data['date'], format=f)
    data['time'] = pd.to_datetime(data['time'], format=f2)
    data['date'] = data['date'].astype(str)
    data['time'] = data['time'].astype(str)
    data['time'] = data['time'].str.slice(10)

    data[DATE_COLUMN] = pd.to_datetime(data['date'].astype(str) +
                                          data['time'].astype(str))

    return data

Let me know if you see any significant improvement in image loading by reducing the res.

You should also consider optimizing the dataframe operations in this part of your code;

for index, row in data2.iterrows():
    id_all = row['image_file'][0:25]
    for index1, row1 in top_meteors.iterrows():
        id_top5 = row1['url'].split('/')[4][0:25]
        if id_all == id_top5:
            if id_top5 not in top_meteor_list:
                top_meteor_list.append(id_top5)
                st.image(row1['url'],width=600)

The nested loop can also be optimized to run much faster.

snowformatics · July 9, 2023, 7:12am

Thanks @tonykip ! It’s working now and nice responsive.

system · July 11, 2023, 7:12am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Streamlit app does not update with new data Using Streamlit cache , discussion	2	599	August 19, 2024
Update github csv with new input entries Using Streamlit streamlit-cloud , database	6	2101	August 8, 2024
My streamlit app is so slow Using Streamlit discussion	2	2339	September 10, 2024
Easiest way to make your deployed app read a csv? Using Streamlit	2	1937	September 11, 2023
Why does my streamlit app take around 8 seconds to load at the beginning even though I'm not using large datasets?Is there a solution to reduce the initial loading time? Using Streamlit discussion	11	382	March 26, 2025

Slow update of data and image loading

Summary

Steps to reproduce

Debug info

Requirements file

Links

Notes

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies