Hi all,
I am new to streamlit and want to use it to visualize a series of distributions over time. I’ve encoded the “time” variable in a slider, and the visualization is a FaceGrid from seaborn. However, when I implemented this, the UI was very slow. I used st.cache
on all the data loading and processing functions, and the only thing that is happening is the generation of the FacetGrid itself upon a change in the slider value. I’ve attached a gif showing the issue.
The dataset in question is around 36k records. I was curious if I am doing something fundamentally wrong, as I had thought this kind of use case is exactly what streamlit was for. I would greatly appreciate anyone’s help on this!
Hi @ArvindR -
From your gif, it does look like the data is being loaded multiple times, which would certainly slow things down. Can you post the code you are running?
Best,
Randy
@randyzwitch here is the code, I swapped out the function code to make it more readable. I used @st.cache
on each function though so I’m not sure why it would reload data everytime.
@st.cache
def load_data():
# return data loaded from CSV files in a dict
# Combine the datasets and standardize the column names
@st.cache
def combine_yearly_data(loaded_data):
# Combine data from CSV files into a single dataframe
@st.cache
def filter_out_bad_responses(data):
# Filter out bad data
@st.cache
def filter_to_top_metros(data, num_metros=25):
# Further filter data
# Prepare data for visualization
data = load_data()
combined_data = combine_yearly_data(data)
filtered_data = filter_out_bad_responses(combined_data)
top_metros_prepared_data = filter_to_top_metros(filtered_data)
year = st.slider("Year", 2007, 2017, step=2)
vac_stats = top_metros_prepared_data[top_metros_prepared_data['YEAR'] == str(year)]\
.groupby('METRO')['VACMONTHS']\
.agg(["count","median"])
bins = np.arange(0,26,1)
g = sns.FacetGrid(top_metros_prepared_data[top_metros_prepared_data['YEAR'] == str(year)], col='METRO', col_wrap=5, col_order = vac_stats.index.to_list())
g = g.map(sns.distplot, 'VACMONTHS', bins=bins).set(xlim=(0,25)).set_titles("{col_name}").set_axis_labels("Months vacant")
for axis in g.axes:
sample_size = vac_stats['count'][axis.title.get_text()]
median = vac_stats['median'][axis.title.get_text()]
axis.set_title(f"{axis.title.get_text()}|Med:{int(median)}|Sample:{sample_size}")
g.fig.suptitle(f"{year} Vacancy Distributions", size=16)
g.fig.subplots_adjust(top=0.94)
st.pyplot()
Nothing immediately sticks out here, so this is probably one of those things where if I was going to solve this, I’d start benchmarking each step. For example, I’m not familiar with how fast a FacetGrid renders, so I don’t know if that’s an expensive function or not (a 25-way set of plots seems involved).
You could also consider pre-computation, to further speed things up, since your group-by appears to be constant as groupby('METRO')['VACMONTHS']
and your metrics are also with ["count","median"]
@randyzwitch thanks for the tips - I’ll try those out. The last thing though - I was curious why the UI said load_data
and combine_yearly_data
were running each time, even though they were cached? I assumed it was a UI bug since print statements from those functions were not being executed/
Well yes, that’s the other question. I don’t think this is a bug per se, but figuring out why it’s saying that was my first suggestion. Without having the full code and dataset to explore, it’s hard for me to suggest what the issue might be.