How to clean variables for chatbot

My intention is to create a chatbot that can receive text and audio as input. if the user user input from audio, the audio will be translated to text first, llm generate the response. and also use tts to convert the response into audio and play it. I’m running app on my server

Expected Behavior

i can change the input between text and audio with no bugs

Current Behavior

when i cange from input using audio to input using text, it seems like previous audio still exist. so the chatbot will append the bubble with multiple previous message
The app does not encounter any error so i cannot provide some error message. I want the system to be able to return responses in the correct order. Currently, if I input text for the first time, the system can return the results properly. However, if the second input uses audio, the bubble from the previous text input will be reprinted. I apologize if my explanation is unclear; you can ask if you need more information.

I tried to set the audio_bytes to None after set response, but it seems to not work. Please somebody help me. Thank you

my env

python 3.10.12

Here is my code:

import base64
import streamlit as st
import gc
import time
import io
from gtts import gTTS
import whisper
from st_audiorec import st_audiorec

st.title("My GPT")
st.subheader("Hi I'm your platform assistant!")

buff, col, col2, buff2 = st.columns([1, 1, 1, 3])
model_name = "wizard-vicuna"

st.text("You can record you voice to begin asking:")
submit_button = None
query = None
audio_bytes = None

audio_bytes = st_audiorec()

st.text("Otherwise, type it also is a good idea")

with st.form(key="myform", clear_on_submit=True):
    query = st.text_input(
        label="query", key="input", value="", label_visibility="collapsed"
    submit_button = st.form_submit_button("Submit")

if "messages" not in st.session_state:
    st.session_state["messages"] = []

    .stTextArea [data-baseweb=base-input] {
        background-image: linear-gradient(45deg, #fff176, #ffeb3b, #fff176);
        -webkit-text-fill-color: black;

    .stTextArea [data-baseweb=base-input] [disabled=""]{
        background-image: linear-gradient(45deg, #b0e0e6, #c2d5e1, #b0e0e6);
        -webkit-text-fill-color: black;

mybot = st.empty()

if audio_bytes:
    with st.spinner("Convert speech to text..."):
        ### voice to text part ###
        with open("test.wav", "wb") as f:
        model = whisper.load_model("tiny.en")
        query = model.transcribe("test.wav")["text"]
        print(f"STT Result: {query}")

if submit_button or query:

  temp = ""
  mybot.text_area("willow: ", temp, height=200, disabled=True)
  for _, hist in enumerate(reversed(st.session_state["messages"])):
    if _ % 2 == 0:
        st.text_area("user", hist, key=f"u{_}")
        st.text_area("willow", hist, key=f"w{_}", disabled=True)
  for res in get_llm_response(): # list of string
    for r in res:
        temp += r
        mybot.text_area("willow: ", temp, height=200, disabled=True)
  if audio_bytes:
    with io.BytesIO() as sound_file:
        tts = gTTS(temp, lang="en").write_to_fp(sound_file)
        sound = sound_file.getvalue()
    audio_base64 = base64.b64encode(sound).decode("utf-8")
    audio_tag = (
            f'<audio autoplay="true" src="data:audio/wav;base64,{audio_base64}">'
    st.markdown(audio_tag, unsafe_allow_html=True)
submit_button = None
query = None
audio_bytes = None
audio_tag = None

Hi @Muhammad_Fhadli

To help the community in understanding the problem you’re encountering more, can you elaborate on any error message that you got to know that it did not work. If it worked, how would we know that it worked. Thanks in advance for the clarifications.

hi, i’m sorry. thank you for the reminder. i have add some paragraph to explain my intention, thank you