Show: Rolling token compression for Streamlit chat apps – what would you change?

rastudub · March 21, 2026, 7:53pm

Hi folks,

I’ve been wrestling with token costs in AI chat apps and tried a simple pattern:

The approach:

Keep last N turns uncompressed (for recency)
Summarize older turns heuristically
Re-summarize the summary if it grows too large

What it achieves:

~80-90% token reduction for long conversations
No context loss for recent turns
Works with any provider (tested: Claude, Mistral, DeepSeek)

What I’m unsure about:

Is heuristic summarization robust enough for production?
Would you compress differently for code vs. natural language chats?
Any edge cases I haven’t considered?

Minimal snippet (the core logic):

def compress_chat(history, max_tokens=2000):
    # Keep last 3 turns intact
    recent = history[-3:]
    # Summarize older turns
    older_summary = summarize(history[:-3])
    return [older_summary] + recent

(If helpful, I’ve put a working Streamlit reference on GitHub – but I’d rather hear about your compression patterns than promote code: GitHub - 20centAI/20centai: Educational AI chat client: provider abstraction, token compression & state management in ~600 lines Python. Learn robust AI integration patterns. · GitHub)
Thanks for any thoughts!

Topic		Replies	Views
[Showcase] 20centAI: A Minimal AI Chat Client That Switches Models Mid-Chat (and Saves 90% Tokens) Show the Community!	0	21	March 24, 2026
Paragraph summarizer using streamlit Show the Community!	3	1505	May 13, 2022
Chat Message Reload Behaviour Using Streamlit session-state , discussion	2	540	December 25, 2024
Conversation Summary Buffer Memory is reset, what should I do? LLMs and AI discussion	0	489	March 19, 2024
Streamlit Chatbot: Token Streaming LLMs and AI	1	3377	December 19, 2023

Show: Rolling token compression for Streamlit chat apps – what would you change?

Related topics