Pandas merge not working as expected in streamlit

erjieyong · January 1, 2023, 5:29am

Summary

When using pandas merge function within a callback function, the dataframe is not updated correctly. However, the pandas drop function works as expected

Note that although i have turned on st.cache. The same behavior is noted when removing the cache function as well.

Steps to reproduce

Code snippet:

import streamlit as st
import pandas as pd


@st.cache(allow_output_mutation=True)
def read_df():
    df = pd.DataFrame({
        'col1':[1,2],
        'col2':['A','B']
    })
    return df

df = read_df()

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[1,2],
        'col3':["X","Y"]
    })
    df.drop(['col2'], axis = 1, inplace = True)
    df = df.merge(df_new, on="col1")

st.button("Do Something", on_click=do_something, args =())

download_csv = df.to_csv().encode('utf-8')
st.download_button('Download', data = download_csv, file_name = 'download_csv.csv', mime='text/csv')

Steps to reproduce behavior

click on “Do Something” button
click on “Download” button

Expected behavior:

I would expect the downloaded csv to be displayed

   col1 col3
0     1    X
1     2    Y

Actual behavior:

However, i get the following output instead

   col1 
0     1    
1     2

Debug info

Streamlit version: 1.16.0
Python version: 3.8.15
Using Conda: Yes
OS version: Windows 11
Browser version: Edge v108.0.1462.54

mathcatsand · January 1, 2023, 7:27am

Your code is always re-initializing df. It really doesn’t matter what you do in your callback function because as soon as the callback is finished, your page will reload and df will go back to df = read_df().

Write statements added to clarify what is happening.

import streamlit as st
import pandas as pd


@st.cache(allow_output_mutation=True)
def read_df():
    df = pd.DataFrame({
        'col1':[1,2],
        'col2':['A','B']
    })
    return df

df = read_df()
st.write('This is df at the start of the page, created from read_df()')
df

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[1,2],
        'col3':["X","Y"]
    })
    df.drop(['col2'], axis = 1, inplace = True)
    df = df.merge(df_new, on="col1")
    st.write('This is df at the end of the callback')
    df

st.button("Do Something", on_click=do_something, args =())

download_csv = df.to_csv().encode('utf-8')
st.download_button('Download', data = download_csv, file_name = 'download_csv.csv', mime='text/csv')

If you want a callback function to change an object and have memory of that change, you need to use session state.

import streamlit as st
import pandas as pd


@st.cache(allow_output_mutation=True)
def read_df():
    df = pd.DataFrame({
        'col1':[1,2],
        'col2':['A','B']
    })
    return df

if 'df' not in st.session_state:
    st.session_state.df = read_df()
df=st.session_state.df

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[1,2],
        'col3':["X","Y"]
    })
    df.drop(['col2'], axis = 1, inplace = True)
    st.session_state.df = df.merge(df_new, on="col1")

st.button("Do Something", on_click=do_something, args =())

st.write('This is df')
df

download_csv = df.to_csv().encode('utf-8')
st.download_button('Download', data = download_csv, file_name = 'download_csv.csv', mime='text/csv')

erjieyong · January 1, 2023, 8:33am

Hi @mathcatsand ,

Noted your explanation. May i know then why does streamlit have memory of the drop function but not the merge function?

Feels like its something obvious that i missed out. Thanks

mathcatsand · January 1, 2023, 8:37am

I don’t think it does. Can you show more what you are getting to make you think it does?

erjieyong · January 1, 2023, 8:45am

If you run my initial code, the initial df should be as such

   col1 col2
0     1    A
1     2    B

After you click on Do Something, the generated df would be

   col1 
0     1    
1     2

From the above result, i am inferring that streamlit manage to save the drop function’s output but did not save the merge function’s output

mathcatsand · January 1, 2023, 8:55am

It’s a result of the allow_output_mutation flag. It doesn’t matter that you declare df a global variable in the callback, because when the page reloads it grabs df from the cached read_df function. You did the column drop in place. The merge was not done in place, but via assignment instead. So the merge didn’t mutate what was cached.

erjieyong · January 1, 2023, 9:12am

I see. So in a way,

drop in place is done before page reload
merge assignment is done after page reload.

And since it did not manage to save the merge, that’s why only col1 remains. Is my understanding correct?

By the way, thanks for patiently explaining this to me!

mathcatsand · January 1, 2023, 9:33am

No, they are are both done before the page reload.

The only reason it gives the illusion of remembering the drop, is because you mutated what was actually cached. Generally, I do not recommend mutating things that are cached since it can lead to confusion.

On the first page load, read_df is executed and the result is cached (containing col1 and col2).
Click Do Something
You drop col2 in place, acting on the output of the read_df function.
Since you allow mutation, the cached value is changed.
You then perform a merge, but not in place, so the global variable df is updated but not the cached value.
The page reloads.
read_df is executed again, but grabs the result from cache. This is the mutated result with the dropped column.

erjieyong · January 2, 2023, 2:12am

Is there any way for me to perform the merge in place and save it to cache, given that it isn’t a functionality built into the pandas function?

Just wondering if we can do it without using the session state method that you suggested. I’ve tried using it but it seems slightly slower.

mathcatsand · January 2, 2023, 2:40am

Not for merge, specifically. Merge has no inplace optional keyword and will always return a copy. You can use a column assignment to do it manually if you need to.

Here’s a manual way around merge to do the operation inplace. Either match the indices or apply a map to define an added column. It would be convenient if you kept an index on all your dataframes that you would be merging on, but in this example I set and reset the indices so ‘col1’ only temporarily serves as the index for the merge.

def do_something():
    global df
    df_new = pd.DataFrame({
        'col1':[2,1],
        'col3':["Y","X"]
    })    
    df.drop(['col2'], axis = 1, inplace = True)
    df_new.set_index('col1', inplace=True)
    df.set_index('col1', inplace=True)
    df['col3'] = df_new['col3']
    df.reset_index(inplace=True)

Using session state like I showed above is generally a better way to manage an accumulation of changes. By mutating the cache, the next user will come along and get that mutated result, just to be clear. Vs using session state allows everyone to start from the same initial point. Perhaps that’s what you’re going for, but just being explicit here.

erjieyong · January 2, 2023, 3:12am

Fantastic answer! I’ve thoroughly understood it now. Really thanks for your patience explaining all these to me!

system · January 2, 2024, 3:13am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using caching with API calls and messy DataFrames Using Streamlit cache , pandas	5	1251	November 19, 2021
St.cache and ouput mutation Using Streamlit	3	902	May 13, 2022
Merge two pandas dataframe Using Streamlit pandas	3	2843	January 12, 2022
Does streamlit re-load python file from beginning upon clicking checkbox? Using Streamlit	4	1638	August 30, 2023
Update dataframe from csv file when updated Community Cloud pandas , streamlit-cloud	5	1301	October 16, 2023

Pandas merge not working as expected in streamlit

Summary

Steps to reproduce

Debug info

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies