My intention is to create a chatbot that can receive text and audio as input. if the user user input from audio, the audio will be translated to text first, llm generate the response. and also use tts to convert the response into audio and play it. I’m running app on my server
Expected Behavior
i can change the input between text and audio with no bugs
Current Behavior
when i cange from input using audio to input using text, it seems like previous audio still exist. so the chatbot will append the bubble with multiple previous message
The app does not encounter any error so i cannot provide some error message. I want the system to be able to return responses in the correct order. Currently, if I input text for the first time, the system can return the results properly. However, if the second input uses audio, the bubble from the previous text input will be reprinted. I apologize if my explanation is unclear; you can ask if you need more information.
I tried to set the audio_bytes to None after set response, but it seems to not work. Please somebody help me. Thank you
my env
python 3.10.12
streamlit==1.25.0
streamlit-audiorec==0.1.3
Here is my code:
import base64
import streamlit as st
import gc
import time
import io
from gtts import gTTS
import whisper
from st_audiorec import st_audiorec
st.title("My GPT")
st.subheader("Hi I'm your platform assistant!")
buff, col, col2, buff2 = st.columns([1, 1, 1, 3])
model_name = "wizard-vicuna"
st.text("You can record you voice to begin asking:")
submit_button = None
query = None
audio_bytes = None
audio_bytes = st_audiorec()
st.text("Otherwise, type it also is a good idea")
with st.form(key="myform", clear_on_submit=True):
query = st.text_input(
label="query", key="input", value="", label_visibility="collapsed"
)
submit_button = st.form_submit_button("Submit")
if "messages" not in st.session_state:
st.session_state["messages"] = []
st.markdown(
"""
<style>
.stTextArea [data-baseweb=base-input] {
background-image: linear-gradient(45deg, #fff176, #ffeb3b, #fff176);
-webkit-text-fill-color: black;
}
.stTextArea [data-baseweb=base-input] [disabled=""]{
background-image: linear-gradient(45deg, #b0e0e6, #c2d5e1, #b0e0e6);
-webkit-text-fill-color: black;
}
</style>
""",
unsafe_allow_html=True,
)
mybot = st.empty()
if audio_bytes:
with st.spinner("Convert speech to text..."):
gc.collect()
### voice to text part ###
with open("test.wav", "wb") as f:
f.write(audio_bytes)
model = whisper.load_model("tiny.en")
query = model.transcribe("test.wav")["text"]
print(f"STT Result: {query}")
if submit_button or query:
st.session_state.messages.append(query)
temp = ""
mybot.text_area("willow: ", temp, height=200, disabled=True)
for _, hist in enumerate(reversed(st.session_state["messages"])):
if _ % 2 == 0:
st.text_area("user", hist, key=f"u{_}")
else:
st.text_area("willow", hist, key=f"w{_}", disabled=True)
for res in get_llm_response(): # list of string
for r in res:
temp += r
mybot.text_area("willow: ", temp, height=200, disabled=True)
time.sleep(0.1)
st.session_state.messages.append(temp)
if audio_bytes:
with io.BytesIO() as sound_file:
tts = gTTS(temp, lang="en").write_to_fp(sound_file)
sound = sound_file.getvalue()
audio_base64 = base64.b64encode(sound).decode("utf-8")
audio_tag = (
f'<audio autoplay="true" src="data:audio/wav;base64,{audio_base64}">'
)
st.markdown(audio_tag, unsafe_allow_html=True)
submit_button = None
query = None
audio_bytes = None
audio_tag = None