Show: Rolling token compression for Streamlit chat apps – what would you change?

Hi folks,

I’ve been wrestling with token costs in AI chat apps and tried a simple pattern:

The approach:

  • Keep last N turns uncompressed (for recency)
  • Summarize older turns heuristically
  • Re-summarize the summary if it grows too large

What it achieves:

  • ~80-90% token reduction for long conversations
  • No context loss for recent turns
  • Works with any provider (tested: Claude, Mistral, DeepSeek)

What I’m unsure about:

  • Is heuristic summarization robust enough for production?
  • Would you compress differently for code vs. natural language chats?
  • Any edge cases I haven’t considered?

Minimal snippet (the core logic):

def compress_chat(history, max_tokens=2000):
    # Keep last 3 turns intact
    recent = history[-3:]
    # Summarize older turns
    older_summary = summarize(history[:-3])
    return [older_summary] + recent

(If helpful, I’ve put a working Streamlit reference on GitHub – but I’d rather hear about your compression patterns than promote code: GitHub - 20centAI/20centai: Educational AI chat client: provider abstraction, token compression & state management in ~600 lines Python. Learn robust AI integration patterns. · GitHub)
Thanks for any thoughts!