Memory consumption issues

Hi all.

I am starting with this great tool, but I am having problems with memory when running the script. The script seem to compile without any problem, but inmediatelly occupy the whole ram of the computer and the swap (in ubuntu). It opens the app in the browser, displays the first few lines with basic inline information, but at the moment to display pyplots it crashes. Trying to see if it was a problem related with large data sets (the full data set contains ~1 millon lines), I started to only read smaller subsets of the data, smaller and smaller every time, but the problems remain exactly the same.

Bellow you will find the script that I am runnning.

import pandas as pd
import os
import numpy as np
import datetime
import glob
import streamlit as st
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px

st.title('APP Oriente')

@st.cache
   def load_data():
       files_only = glob.glob('*.csv')
       data = pd.DataFrame()
       for i in files_only[0:100]: 
           df = pd.read_csv(i) 
           data = data.append(df)
       index = np.arange(0,len(data))
       data = data.set_index(index) 
       to_drop=['Bar Trend', 'Next Record',
          'Inside Temperature', 'Inside Humidity',
          '10 Min Avg Wind Speed', 'Extra Temperatures', 'Soil Temperatures', 'Leaf Temperatures',
          'Extra Humidties', 'Storm Rain', 'Start Date of current Storm',
          'Month Rain', 'Year Rain', 'Day ET', 'Month ET', 'Year ET',
          'Soil Moistures', 'Leaf Wetnesses', 'Inside Alarms', 'Rain Alarms',
          'Outside Alarms', 'Extra Temp/Hum Alarms', 'Soil & Leaf Alarms',
          'Transmitter Battery Status', 'Console Battery Voltage',
          'Forecast Icons', 'Forecast Rule number', 'Time of Sunrise',
          'Time of Sunset', '<LF> = 0x0A', '<CR> = 0x0D', 'CRC']
       data.drop(to_drop, axis=1)
       data['Tiempo Sistema'] = pd.to_datetime(data['Tiempo Sistema']) - pd.to_timedelta(5,unit='h')
       data['Rain Rate'] = data['Rain Rate']*0.2/10. #in units of cm/hour
       data['Barometer'] = data['Barometer']/1000. + 760 # FALTA CORREGIR ESTE CLALCULO
       data['Outside Temperature'] = ( data['Outside Temperature']/10. - 32.) * (5.0/9.0)
       data.loc[(data['Outside Temperature'] > 50) | (data['Outside Temperature'] < -15)] = np.nan
       return data

dt = load_data()

if st.checkbox('Mostrar datos'):
   '## Data',dt[0:30]

'## Fecha/hora',dt['Tiempo Sistema'].iloc[-1]
'## Temperatura',round(dt['Outside Temperature'].iloc[-1],2)
'## Presión',dt['Barometer'].iloc[-1]

st.header("Plotly Temperatura")

time_temp_plt = go.Scatter(x=dt['Tiempo Sistema'], y=dt['Outside Temperature'], mode = 'markers')

dt_plt = [time_temp_plt]
layout = go.Layout(xaxis_title="Date [MM YY]", yaxis_title="Temperature [C]")
fig = go.Figure(data=dt_plt, layout=layout)

st.header("Plotly Presión")

time_temp_plt = go.Scatter(x=dt['Tiempo Sistema'], y=dt['Barometer'], mode = 'markers')

dt_plt = [time_temp_plt]
layout = go.Layout(xaxis_title="Date [MM YY]", yaxis_title="Presión [hpa]")
fig = go.Figure(data=dt_plt, layout=layout)

st.write(fig)

Any thoughts?

Thanks in advance.

Hi @esiilvavilla, could you edit your post and use markdown so that indentation is maintained for your code sample? This is possible by using three backticks above and below your code, ex. ```

Could you also provide a sample of the data so that we can try to reproduce your issue? Looks like you have a number of csv files, could you provide a couple of rows from one of the files?

Hi @Jonathan_Rhone.

The post is eddited, and you can download a sample of the data with this url: shorturl.at/nopFP.

Let me know if everything works fine.

Thanks

Thanks for the updates and the sample! And welcome to the forum by the way :wave:

I was able to run your app using the sample data you provided.

Is it not working for you using just the sample data? How about with just one file?

Could you also provide the following info?

  • Streamlit version: (get it with $ streamlit version)
  • Python version: (get it with $ python --version)
  • Using Conda? PipEnv? PyEnv? Pex?
  • OS version:
  • Browser version:

Thanks for your time and work @Jonathan_Rhone .

Ok, with the test data it seems to run. I didn’t check with that particular sample and just passed to you a random number of files.

Before to continue, let me give you the information you asked me for:

  • Streamlit version: Streamlit, version 0.56.0
  • Python version: Python 3.6.9
  • Using Conda
  • OS version: Ubuntu 18.04 LTS
  • Browser version: Versión 80.0.3987.162 (Build oficial) (64 bits)

Now, I have a couple of questions:

  1. Did you have any increase in your memory consumption? Because I saw it here. Now, I am asking this because this is just a subsample of the whole data set, and it takes a bit of time for the browser to load the information.
  2. Why, if we are reading the same data set, the values of Temperature, date/time and pressure are different? (The plot seems the same though)
  3. Could it be that the plots are taking a large bunch of memory by themselves? I wonder. These plots are super simple plots of X vs Y, read from a simple dataframe, nothing fancy at all.

This is my screenshot:

Regarding question 2, looks like there’s some discrepancies between the sample code you posted and your current script. The one you shared was missing the st.write(fig) call for the Temperature graph. I’ve added it to my app and now we get similar reports.

There’s still some differences in the data I see so perhaps you have some additional csv’s in your workspace that I don’t have that are being loaded?

Here are my files and their lengths as shown by len(df) inside the for loop in def load_data()

DatosEstacion2020-01-21.csv
1442

DatosEstacion2020-01-20.csv
1440

DatosEstacion2020-01-22.csv
1440

DatosEstacion2020-01-23.csv
1440

DatosEstacion2020-01-27.csv
1440

DatosEstacion2020-01-26.csv
1440

DatosEstacion2020-01-18.csv
1440

DatosEstacion2020-01-30.csv
1440

DatosEstacion2020-01-24.csv
1444

DatosEstacion2020-01-25.csv
1440

DatosEstacion2020-01-31.csv
1439

DatosEstacion2020-01-19.csv
1440

DatosEstacion2020-01-28.csv
887

DatosEstacion2020-01-29.csv
1349

DatosEstacion2020-01-17.csv
974

DatosEstacion2020-01-16.csv
491

DatosEstacion2020-01-13.csv
1

Regarding questions 1 and 3, how are you inspecting memory usage in the browser? If you install and open chrome dev tools, there’s a Memory tab that shows the current JS Heap Size for the browser tab.

Yes I have some increase in memory usage but not much given that the sample data is small. If you’re sending a million data points to the browser maybe it’s simply using more memory than the browser can handle and is crashing the tab.

The maximum memory per tab is either 2GB or 4GB I believe.

If you go to the Console tab in the chrome developer tools you should see some output that says Protobuf: .... These represent the data sent for the elements in your report. Given the current code, the last one corresponds to the graph at the bottom. If you drill in a bit you can see the spec data for the chart, as well as the size of the data, and that we’re sending over 1.3 MB of data per chart just for this sample dataset.

Actually, I believe this output is only shown when Streamlit is installed in development mode which means you may have to install it from source, not from PYPI.

It seems that it’s taking one to two seconds to send the 1.3MB protobuf over the websocket, which is causing that small delay in loading the graph that you’re seeing. It’s also taking a bit of time to render the graph after the data is received, so perhaps a combination of the two.

At a certain point too many data points is too many data points, is it possible for you to coalesce the data in some way? I imagine you don’t need to plot the chart with the same granularity of the raw data…?

Hi @Jonathan_Rhone.

Thanks for yuor replies, they have been very useful!

Ok, just to be sure that we have the same information, displaying the same values and having the same code, I drop here the latest code i used. It uses the exact same data set I shared with you before.

With this code, and this sample of data, the script runs, displays the information and I manage to work with the options in the browser. I made a small check on the memory consumption, and as long as I use this data set, things work fine more or less, not too fast, not too slow.

The tool I use to check the memory of my laptop is a basic “htop” in the console. While using this data set you have, memory is ok. However, if I move to use a large data set, this is what I get:

The memory and the swap are full, and it seems that chrome is the one taking all of it (or at least a great decent part). Plus, the page in chrome crashes. I tryied to check for the tools you mentioned, but I haven’t found yet the same information in the dev tools of chrome. I will keep looking!

Here you see the code, so you can just run it:

import pandas as pd
import os
import numpy as np
import datetime as dt
import glob
import calendar 
import pdb
import streamlit as st
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px


st.title('APP Oriente')

# READ THE DATA FROM FILE === NOT STORED IN SERVER ... YET
@st.cache
def load_data():
    files_only = glob.glob('*.csv')

    data = pd.DataFrame()
    for i in files_only: 
        df = pd.read_csv(i) 
        data = data.append(df)
    
    # Sort data by date
    data.sort_values(by='Tiempo Sistema', inplace=True, ascending=True)

    # Drop columns
    cols_to_drop=['Bar Trend', 'Next Record',
       'Inside Temperature', 'Inside Humidity',
       '10 Min Avg Wind Speed', 'Extra Temperatures', 'Soil Temperatures', 'Leaf Temperatures',
       'Extra Humidties', 'Storm Rain', 'Start Date of current Storm',
       'Month Rain', 'Year Rain', 'Day ET', 'Month ET', 'Year ET',
       'Soil Moistures', 'Leaf Wetnesses', 'Inside Alarms', 'Rain Alarms',
       'Outside Alarms', 'Extra Temp/Hum Alarms', 'Soil & Leaf Alarms',
       'Transmitter Battery Status', 'Console Battery Voltage',
       'Forecast Icons', 'Forecast Rule number', 'Time of Sunrise',
       'Time of Sunset', '<LF> = 0x0A', '<CR> = 0x0D', 'CRC']
    data.drop(cols_to_drop, axis=1, inplace = True)

    index = np.arange(0,len(data))
    data = data.set_index(index) 

    # Apply basic transformations
    data['Tiempo Sistema'] = pd.to_datetime(data['Tiempo Sistema']) - pd.to_timedelta(5,unit='h')
    data['Rain Rate'] = data['Rain Rate']*0.2/10. #in units of cm/hour
    data['Barometer'] = data['Barometer']/1000. + 760 # FALTA CORREGIR ESTE CLALCULO
    data['Outside Temperature'] = ( data['Outside Temperature']/10. - 32.) * (5.0/9.0)
    data.loc[(data['Outside Temperature'] > 50) | (data['Outside Temperature'] < -15)] = np.nan
    
    return data

# Load de data
df = load_data()

st.title('Última actualización')

'## Fecha[Y-M-D]/hora',df['Tiempo Sistema'].iloc[-1]
'## Temperatura',round(df['Outside Temperature'].iloc[-1],2),'°C'
'## Presión',df['Barometer'].iloc[-1],'hPa'

st.title('Gráficos')

st.header("Plotly Temperatura")
min_date_data = pd.to_datetime(df['Tiempo Sistema']).min()
max_date_data = pd.to_datetime(df['Tiempo Sistema']).max()

st.sidebar.markdown('# Fechas plots')
tmp1 = st.sidebar.date_input('Fecha inicial',min_date_data)
tmp2 = st.sidebar.date_input('Fecha final')

ok = ( (pd.to_datetime(df['Tiempo Sistema']) >= pd.to_datetime(tmp1)) & \
       (pd.to_datetime(df['Tiempo Sistema']) <  pd.to_datetime(tmp2)) )

st.write('Gráfico de temperatura entre',tmp1,' y ',tmp2)
df2 = df[ok]
time_temp_plt = go.Scatter(x=df2['Tiempo Sistema'], y=df2['Outside Temperature'], mode = 'markers')
df_plt = [time_temp_plt]
layout = go.Layout(xaxis_title="Date", yaxis_title="Temperature [C]")
fig = go.Figure(data=df_plt, layout=layout)
st.write(fig)

st.header("Plotly Presión")
st.write('Gráfico de presión entre',tmp1,' y ',tmp2)
time_temp_plt = go.Scatter(x=df2['Tiempo Sistema'], y=df2['Barometer'], mode = 'markers')
df_plt = [time_temp_plt]
layout = go.Layout(xaxis_title="Date", yaxis_title="Presión [hpa]")
fig = go.Figure(data=df_plt, layout=layout)
st.write(fig)

if st.checkbox('Mostrar datos'):
    '## Data',len(df2)
    df2

Thanks again!

Here you see the code, so you can just run it:

The memory and the swap are full

It looks like you still have about 2 GB of free memory, with 5.6 GB used out of 7.6 GB total.

Plus, the page in chrome crashes.

Is there an error message? Sounds like you’re hitting the memory limits for a browser tab.

I haven’t found yet the same information in the dev tools of chrome

Can you open dev tools and click on the 3 dots in the upper right hand corner? In the drop down menu hover over More Tools, perhaps your Memory tab is hidden in there? I also see a Performance Monitor option in there, this also displays the JS Heap Size for me.

Other

Here’s some possibly related light reading

Hey @esiilvavilla, could you share your sample data again? Wanted to run some performance tests on a development branch against your app.

Hi @Jonathan_Rhone.

Sorry for the late reply, but until today I found time to work with this again.

Here are two links:

  1. https://www.dropbox.com/sh/mkgd9u34k8nnd94/AADIh_ys4jYqCnz40X3qbndva?dl=0
  2. https://www.dropbox.com/sh/wwbv5bs4c3ohxhu/AAC-n2KZhNDONWnqVpsliV9ka?dl=0

In the first link you will find the same data set I shared with you, information only for one month. The second link increases the sample to 2 months.

Now, the app looks different now, I made a few changes. I did make a comparisson among the last screenshot you sent me showing the data and it reads and presents the same results as what I have being doing here.

Regarding memory consumption, I here by left some screenshots. It seems it is working fine, although I see that the larger the sample, the slower to app.

This is using the second set of data (link number 2)

This is using the first set of data (link number 1)

I will keep looking for options to see how I can work with a large (huge) data set, and have a fast app.

Thanks a lot

One thing that helps is using go.Scattergl instead of go.Scatter. I can see a noticeable performance boost!

Hi @Jonathan_Rhone.

I will give it a try. Do you have an example of whta you used? Also, how did it work with the sample data?

Ciao and thanks,
Esteban

I used the most recent code that you posted, and the sample data in dropbox.

Only I changed go.Scatter to go.Scattergl in your app.