Pandas merge not working as expected in streamlit

Summary

When using pandas merge function within a callback function, the dataframe is not updated correctly. However, the pandas drop function works as expected

Note that although i have turned on st.cache. The same behavior is noted when removing the cache function as well.

Steps to reproduce

Code snippet:

import streamlit as st
import pandas as pd


@st.cache(allow_output_mutation=True)
def read_df():
    df = pd.DataFrame({
        'col1':[1,2],
        'col2':['A','B']
    })
    return df

df = read_df()

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[1,2],
        'col3':["X","Y"]
    })
    df.drop(['col2'], axis = 1, inplace = True)
    df = df.merge(df_new, on="col1")

st.button("Do Something", on_click=do_something, args =())

download_csv = df.to_csv().encode('utf-8')
st.download_button('Download', data = download_csv, file_name = 'download_csv.csv', mime='text/csv')

Steps to reproduce behavior

  • click on “Do Something” button
  • click on “Download” button

Expected behavior:

I would expect the downloaded csv to be displayed

   col1 col3
0     1    X
1     2    Y

Actual behavior:

However, i get the following output instead

   col1 
0     1    
1     2   

Debug info

  • Streamlit version: 1.16.0
  • Python version: 3.8.15
  • Using Conda: Yes
  • OS version: Windows 11
  • Browser version: Edge v108.0.1462.54

Your code is always re-initializing df. It really doesn’t matter what you do in your callback function because as soon as the callback is finished, your page will reload and df will go back to df = read_df().

Write statements added to clarify what is happening.

import streamlit as st
import pandas as pd


@st.cache(allow_output_mutation=True)
def read_df():
    df = pd.DataFrame({
        'col1':[1,2],
        'col2':['A','B']
    })
    return df

df = read_df()
st.write('This is df at the start of the page, created from read_df()')
df

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[1,2],
        'col3':["X","Y"]
    })
    df.drop(['col2'], axis = 1, inplace = True)
    df = df.merge(df_new, on="col1")
    st.write('This is df at the end of the callback')
    df

st.button("Do Something", on_click=do_something, args =())

download_csv = df.to_csv().encode('utf-8')
st.download_button('Download', data = download_csv, file_name = 'download_csv.csv', mime='text/csv')

If you want a callback function to change an object and have memory of that change, you need to use session state.

import streamlit as st
import pandas as pd


@st.cache(allow_output_mutation=True)
def read_df():
    df = pd.DataFrame({
        'col1':[1,2],
        'col2':['A','B']
    })
    return df

if 'df' not in st.session_state:
    st.session_state.df = read_df()
df=st.session_state.df

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[1,2],
        'col3':["X","Y"]
    })
    df.drop(['col2'], axis = 1, inplace = True)
    st.session_state.df = df.merge(df_new, on="col1")

st.button("Do Something", on_click=do_something, args =())

st.write('This is df')
df

download_csv = df.to_csv().encode('utf-8')
st.download_button('Download', data = download_csv, file_name = 'download_csv.csv', mime='text/csv')

Hi @mathcatsand ,

Noted your explanation. May i know then why does streamlit have memory of the drop function but not the merge function?

Feels like its something obvious that i missed out. Thanks

I don’t think it does. Can you show more what you are getting to make you think it does?

If you run my initial code, the initial df should be as such

   col1 col2
0     1    A
1     2    B

After you click on Do Something, the generated df would be

   col1 
0     1    
1     2   

From the above result, i am inferring that streamlit manage to save the drop function’s output but did not save the merge function’s output

It’s a result of the allow_output_mutation flag. It doesn’t matter that you declare df a global variable in the callback, because when the page reloads it grabs df from the cached read_df function. You did the column drop in place. The merge was not done in place, but via assignment instead. So the merge didn’t mutate what was cached.

I see. So in a way,

  • drop in place is done before page reload
  • merge assignment is done after page reload.

And since it did not manage to save the merge, that’s why only col1 remains. Is my understanding correct?

By the way, thanks for patiently explaining this to me!

1 Like

No, they are are both done before the page reload.

The only reason it gives the illusion of remembering the drop, is because you mutated what was actually cached. Generally, I do not recommend mutating things that are cached since it can lead to confusion.

  1. On the first page load, read_df is executed and the result is cached (containing col1 and col2).
  2. Click Do Something
  3. You drop col2 in place, acting on the output of the read_df function.
  4. Since you allow mutation, the cached value is changed.
  5. You then perform a merge, but not in place, so the global variable df is updated but not the cached value.
  6. The page reloads.
  7. read_df is executed again, but grabs the result from cache. This is the mutated result with the dropped column.

Is there any way for me to perform the merge in place and save it to cache, given that it isn’t a functionality built into the pandas function?

Just wondering if we can do it without using the session state method that you suggested. I’ve tried using it but it seems slightly slower.

Not for merge, specifically. Merge has no inplace optional keyword and will always return a copy. You can use a column assignment to do it manually if you need to.

Here’s a manual way around merge to do the operation inplace. Either match the indices or apply a map to define an added column. It would be convenient if you kept an index on all your dataframes that you would be merging on, but in this example I set and reset the indices so ‘col1’ only temporarily serves as the index for the merge.

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[2,1],
        'col3':["Y","X"]
    })    
    df.drop(['col2'], axis = 1, inplace = True)
    df_new.set_index('col1', inplace=True)
    df.set_index('col1', inplace=True)
    df['col3'] = df_new['col3']
    df.reset_index(inplace=True)

Using session state like I showed above is generally a better way to manage an accumulation of changes. By mutating the cache, the next user will come along and get that mutated result, just to be clear. Vs using session state allows everyone to start from the same initial point. Perhaps that’s what you’re going for, but just being explicit here.

3 Likes

Fantastic answer! I’ve thoroughly understood it now. Really thanks for your patience explaining all these to me!

2 Likes