Streamlit app for weather data analysis

Hi all,
I’m working on a Streamlit app that compares baseline weather files (TMYx) with future morphed climate scenarios (RCP 2050 / 2080) for building performance analysis.

https://eetra-future-weather-app.streamlit.app/

What the app does

  • Parses EPW weather files (Italy, ~3,500+ files total)

  • Builds:

    • Hourly temperature datasets

    • Daily statistics

    • Monthly summaries

  • Compares:

    • TMYx baseline

    • TMYx variants (2004–2018, 2007–2021, etc.)

    • Morphed RCP scenarios (rcp26, rcp45, rcp85 for 2050/2080)

  • Displays:

    • Regional and national maps

    • Percentile-based temperature deltas

    • Location-level comparisons

    • Interactive charts

All heavy preprocessing (EPW parsing, aggregation, pairing baseline ↔ RCP) is now done offline via Python scripts.
The Streamlit app should ideally only read precomputed parquet files.


Current data structure

Per location, I now generate a single parquet file containing:

  • Baseline TMYx

  • All RCP/year scenarios for that baseline

Folder structure:

data/
  04__italy_tmy_fwg_parquet/
      AB/
      BC/
      ...

Each parquet contains:

  • Hourly dry-bulb temperature

  • Daily stats

  • Monthly stats

  • Scenario metadata


Current performance challenges

Despite precomputing:

  • First load still feels heavy

  • Map rendering (many points) can lag

  • Switching between scenarios sometimes triggers noticeable recalculation

  • Cached functions sometimes invalidate more than expected

The app uses:

  • @st.cache_data

  • Parquet (pyarrow)

  • Pandas

  • Plotly

  • Folium for maps


Questions

  1. Best practices for loading large parquet datasets in Streamlit?

    • Should I pre-split more aggressively (e.g. per region only)?

    • Is DuckDB a better backend than Pandas for this use case?

  2. Map performance:

    • Better approach than Folium for 150–200 markers?

    • Should I pre-aggregate geojson layers?

  3. Caching strategy:

    • Is it better to cache whole DataFrames or pre-serialized lightweight objects?

    • Should I move more logic into st.session_state instead of cache_data?

  4. General architectural advice:

    • Is there a better pattern for large scenario-based analytical apps?

    • Would Snowflake / MotherDuck / DuckDB significantly improve performance?

Welcome to the community and thanks for the detailed question! :rocket: It seems likely that for large, scenario-based analytical apps like yours, the following best practices can help:

1. Loading Large Parquet Datasets:
Pre-splitting data by region or scenario can reduce memory usage and speed up load times, as you only read what’s needed. Using DuckDB to query Parquet files directly (without loading full DataFrames into memory) is often faster and more efficient than Pandas for large datasets. According to Streamlit Docs, you can cache query results with @st.cache_data and set a TTL to avoid stale data.

2. Map Performance:
Folium can lag with many markers. For 150–200 points, consider using Plotly’s scatter_mapbox or scatter_geo, which are more performant for interactive maps in Streamlit. Pre-aggregating or simplifying geojson layers can also help. See this FAQ and performance tips.

3. Caching Strategy:
Cache only what you need—prefer lightweight, pre-serialized objects or query results over entire DataFrames if possible. Use @st.cache_data for data and @st.cache_resource for connections or models (docs). Avoid over-caching, as it can increase memory usage and cause invalidations.

4. Session State vs. Cache:
Use st.session_state for user-specific, session-persistent variables (like UI state or selections), not for large data. Use @st.cache_data for shared, immutable data.

5. General Architecture:
DuckDB (or MotherDuck for cloud) is well-suited for querying Parquet files on demand and can outperform Pandas for large, analytical workloads (docs). Snowflake is powerful but may be overkill unless you need enterprise-scale features.

Would you like a step-by-step example of integrating DuckDB with Streamlit for this use case?

Sources: