Hi all,
I’m working on a Streamlit app that compares baseline weather files (TMYx) with future morphed climate scenarios (RCP 2050 / 2080) for building performance analysis.
https://eetra-future-weather-app.streamlit.app/
What the app does
-
Parses EPW weather files (Italy, ~3,500+ files total)
-
Builds:
-
Hourly temperature datasets
-
Daily statistics
-
Monthly summaries
-
-
Compares:
-
TMYx baseline
-
TMYx variants (2004–2018, 2007–2021, etc.)
-
Morphed RCP scenarios (rcp26, rcp45, rcp85 for 2050/2080)
-
-
Displays:
-
Regional and national maps
-
Percentile-based temperature deltas
-
Location-level comparisons
-
Interactive charts
-
All heavy preprocessing (EPW parsing, aggregation, pairing baseline ↔ RCP) is now done offline via Python scripts.
The Streamlit app should ideally only read precomputed parquet files.
Current data structure
Per location, I now generate a single parquet file containing:
-
Baseline TMYx
-
All RCP/year scenarios for that baseline
Folder structure:
data/
04__italy_tmy_fwg_parquet/
AB/
BC/
...
Each parquet contains:
-
Hourly dry-bulb temperature
-
Daily stats
-
Monthly stats
-
Scenario metadata
Current performance challenges
Despite precomputing:
-
First load still feels heavy
-
Map rendering (many points) can lag
-
Switching between scenarios sometimes triggers noticeable recalculation
-
Cached functions sometimes invalidate more than expected
The app uses:
-
@st.cache_data -
Parquet (pyarrow)
-
Pandas
-
Plotly
-
Folium for maps
Questions
-
Best practices for loading large parquet datasets in Streamlit?
-
Should I pre-split more aggressively (e.g. per region only)?
-
Is DuckDB a better backend than Pandas for this use case?
-
-
Map performance:
-
Better approach than Folium for 150–200 markers?
-
Should I pre-aggregate geojson layers?
-
-
Caching strategy:
-
Is it better to cache whole DataFrames or pre-serialized lightweight objects?
-
Should I move more logic into
st.session_stateinstead ofcache_data?
-
-
General architectural advice:
-
Is there a better pattern for large scenario-based analytical apps?
-
Would Snowflake / MotherDuck / DuckDB significantly improve performance?
-