Thanks for the reply, @randyzwitch . I’ll note that I need to update to the latest version of Streamlit, and I haven’t implemented the chaining yet to see if that helps. ButI just ran things again and it basically ground to a halt, so I’m pretty sure I’m doing something wrong!
Here’s some code I use to load the initial set of dataframes
@st.cache
def load_parquet(years):
# years comes from a st.slider
files = list()
for year in range(years[0], years[1]+1):
# Each parquet file is O(15-20) MB
filename = f"/path/to/parquet/files/{year}.parquet"
# data_cols is O(20) or so columns
df = pd.read_parquet(filename, columns=data_cols)
files.append(df)
df = pd.concat(files)
df = df.reset_index()
return df
My [terrible] spidey sense is telling me this might be a bad idea. My hope was to only load the data for the years needed, but maybe it just makes sense to throw it all into one big DF and load it regardless?
(I’ll note that I basically do this load right after the years slider, but before all the rest of the selection components. It feels like this should be okay, since it’s cached and the data shouldn’t be reloaded unless someone changes the year selection but maybe not?)
I use the Streamlit UI components to generate masks. These vary widely (sliders, multi selects, checkboxes, etc.), but they’re basically relatively straightforward things like
names = st.sidebar.multiselect('Select names', helper.ALL_NAMES, default=default_name, key='name_select')
masks['names'] = data['names'].isin(names)`
(noting to be careful to keep the multiselect
options to < 150 or so). I then apply all my masks with:
# Mask aggregator
def aggregate_masks(data, masks, operator='and'):
if operator == 'and':
return data[reduce(np.logical_and, masks.values())]
elif operator =='or':
return data[reduce(np.logical_or, masks.values())]
and then I use a hack I found on these forums to display a subset of the “final” DF
ret = aggregate_masks(data, masks)
page_size = 1000
page_number = st.number_input(
label="Page Number",
min_value=1,
max_value=math.ceil(len(ret)/page_size),
step=1,
)
current_start = (page_number-1)*page_size
current_end = page_number*page_size if len(ret) > page_number*page_size else len(ret)
st.write(ret[current_start:current_end])
st.write(f"DIsplaying rows {current_start} to {current_end}")
That’s basically it as far as [semi-obvious?] problem spots. As I mentioned, the rest is just UI-to-mask conversion (that I spent a fair amount of time on, and would love to keep vs. converting it all to, say, SQL queries or what have you).
So my hope is that I’m just doing something dumb with the initial loading of the data, and/or the shift to chaining will speed things up. But if anyone sees anything else that’s out of whack, I’d love to know about it.
Thanks for the help!