Thanks for the reply, @randyzwitch . Iāll note that I need to update to the latest version of Streamlit, and I havenāt implemented the chaining yet to see if that helps. ButI just ran things again and it basically ground to a halt, so Iām pretty sure Iām doing something wrong!
Hereās some code I use to load the initial set of dataframes
@st.cache
def load_parquet(years):
# years comes from a st.slider
files = list()
for year in range(years[0], years[1]+1):
# Each parquet file is O(15-20) MB
filename = f"/path/to/parquet/files/{year}.parquet"
# data_cols is O(20) or so columns
df = pd.read_parquet(filename, columns=data_cols)
files.append(df)
df = pd.concat(files)
df = df.reset_index()
return df
My [terrible] spidey sense is telling me this might be a bad idea. My hope was to only load the data for the years needed, but maybe it just makes sense to throw it all into one big DF and load it regardless?
(Iāll note that I basically do this load right after the years slider, but before all the rest of the selection components. It feels like this should be okay, since itās cached and the data shouldnāt be reloaded unless someone changes the year selection but maybe not?)
I use the Streamlit UI components to generate masks. These vary widely (sliders, multi selects, checkboxes, etc.), but theyāre basically relatively straightforward things like
names = st.sidebar.multiselect('Select names', helper.ALL_NAMES, default=default_name, key='name_select')
masks['names'] = data['names'].isin(names)`
(noting to be careful to keep the multiselect
options to < 150 or so). I then apply all my masks with:
# Mask aggregator
def aggregate_masks(data, masks, operator='and'):
if operator == 'and':
return data[reduce(np.logical_and, masks.values())]
elif operator =='or':
return data[reduce(np.logical_or, masks.values())]
and then I use a hack I found on these forums to display a subset of the āfinalā DF
ret = aggregate_masks(data, masks)
page_size = 1000
page_number = st.number_input(
label="Page Number",
min_value=1,
max_value=math.ceil(len(ret)/page_size),
step=1,
)
current_start = (page_number-1)*page_size
current_end = page_number*page_size if len(ret) > page_number*page_size else len(ret)
st.write(ret[current_start:current_end])
st.write(f"DIsplaying rows {current_start} to {current_end}")
Thatās basically it as far as [semi-obvious?] problem spots. As I mentioned, the rest is just UI-to-mask conversion (that I spent a fair amount of time on, and would love to keep vs. converting it all to, say, SQL queries or what have you).
So my hope is that Iām just doing something dumb with the initial loading of the data, and/or the shift to chaining will speed things up. But if anyone sees anything else thatās out of whack, Iād love to know about it.
Thanks for the help!