Streamlit + local Mistral 7B v0.2 = streaming answers, is that possible?

  1. Are you running your app locally or is it deployed?

Running the app locally.

  1. Share the link to your app’s public GitHub repository (including a requirements file).

The app has a private repository.

  1. Share the full text of the error message (not a screenshot).

There are no errors per se.

  1. Share the Streamlit and Python versions.

Python version = 3.11.6

Streamlit version = 1.31.0

Let me start with a huge thanks to the community and especially @andfanilo for wonderful insights and videos, helped us learn a lot!

We are developing a pretty straight-forward chatbot app for a client based on RAG + Mistral 7B idea.

So far, we have managed to successfully “inject” custom CSS, HTML and setup the chat history display, modify the look and feel of the frontend, connect it to Mistral and get the outputs for questions.

As I cannot paste the full code here (I know that’s not super helpful), I can add the libraries that we are using:

import torch

from langchain_community.embeddings import HuggingFaceEmbeddings

from langchain_community.vectorstores import Chroma

from langchain.llms.huggingface_pipeline import HuggingFacePipeline

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig, GenerationConfig

from langchain.chains.question_answering import load_qa_chain

from langchain.prompts import PromptTemplate

import streamlit as st

As of now, the GUI is displaying the answers once they are fully generated, and that can take up to 15-20 seconds on our 3090 in the local settings - therefore we would like to achieve the streaming effect of typing as the answer is generated.

I’ve been personally looking into chunking Mistral’s answers, then chunking the answers outputs, but this made a (very chunky) mess unfortunately. We do have a nice “loading the answer” animation, but I believe that it would look a lot better if we could stream them instead.

Would anyone know a good approach to solving this issue, as I can see it can be achieved when using OpenAI’s API, however we would like to use the models of our choice instead of OpenAI.

Thank you again for any inputs and looking forward to seeing where the Streamlit will go in the future! :slight_smile:

Hello @recooler,

Here’s an approach using Streamlit’s st.empty for placeholders and Python’s asyncio for simulating streaming of text. This won’t directly interface with your specific setup but should provide a foundation you can adapt.

import streamlit as st
import asyncio

async def generate_text_simulated(prompt, delay=1):
    # Simulating text generation by yielding chunks over time
    for i in range(1, 6):  # Example: Generate 5 chunks of text
        yield f"{prompt} response chunk {i}\n"
        await asyncio.sleep(delay)  # Simulate processing delay

def main():
    st.title("Async Text Generation Demo")

    prompt = st.text_input("Enter your prompt:", "Hello")

    if st.button("Generate"):
        text_placeholder = st.empty()
        text_placeholder.text("Generating response...")

        # Run the async text generation in a separate thread
        import threading
        def run_async():
            loop = asyncio.new_event_loop()

            async def update_text():
                generated_text = ""
                async for chunk in generate_text_simulated(prompt, 0.5):
                    generated_text += chunk



if __name__ == "__main__":

Hope this helps!

Kind Regards,
Sahir Maharaj
Data Scientist | AI Engineer

P.S. Lets connect on LinkedIn!

➤ Want me to build your solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and AI
➤ Website:
➤ Email:
➤ 100+ FREE Power BI Themes: Download Now


Hi @sahirmaharaj

This code executed as streamlit script leeds in my case to the no session context error discussed in a lot of other threads according to streamlit and threading. However I can make it partially partially work by changing the start of the thread to this:

        t = threading.Thread(target=run_async)

Langchain however is creating the threads behind the scenes and this cannot be easily done. So any suggestion how langchain (LCEL) and streamlit is used for streaming would still be appreciated.