Hi folks,
I’ve been wrestling with token costs in AI chat apps and tried a simple pattern:
The approach:
- Keep last N turns uncompressed (for recency)
- Summarize older turns heuristically
- Re-summarize the summary if it grows too large
What it achieves:
- ~80-90% token reduction for long conversations
- No context loss for recent turns
- Works with any provider (tested: Claude, Mistral, DeepSeek)
What I’m unsure about:
- Is heuristic summarization robust enough for production?
- Would you compress differently for code vs. natural language chats?
- Any edge cases I haven’t considered?
Minimal snippet (the core logic):
def compress_chat(history, max_tokens=2000):
# Keep last 3 turns intact
recent = history[-3:]
# Summarize older turns
older_summary = summarize(history[:-3])
return [older_summary] + recent
(If helpful, I’ve put a working Streamlit reference on GitHub – but I’d rather hear about your compression patterns than promote code: GitHub - 20centAI/20centai: Educational AI chat client: provider abstraction, token compression & state management in ~600 lines Python. Learn robust AI integration patterns. · GitHub)
Thanks for any thoughts!