New component : streamlit-mic-recorder, designed for easy speech to text implementation

Hi!

Just uploaded my first component to PyPI : streamlit-mic-recorder

It is made for voice recording and easy text to speech implementation in your app.

I tried to make its usage as reliable and easy as can be, and struggled to make it look just like an
st.button widget whatever the theme you choose for your app.

The mic_recorder function records the user’s mic and outputs a dictionary containing the mono audio/wav bytes (along with sample rate and sample width) that will play directly in st.audio, can be written as is to a .wav file, or passed to any audio processing/STT tools you prefer.

The speech_to_text function relies on the SpeechRecognition module to perform speech to text and return the transcribed text directly.

You’ll find all the info you need in the repo.

Give it a try and tell me your feedbacks.

Cheers!

Baptiste

8 Likes

By the way, I noticed that st.button’s background and border color are not exactly colors passed in the theme prop, but look like weighted averages of these colors.

I iterated to get a closely enough result but it’s still not exactly right.

Could a streamlit dev give some details on how these colors are computed ? This could be helpful when we want to design components that integrate well in any theming of the app.

@Charly_Wargnier

Thanks in advance.

Cheers.

Baptiste

3 Likes

That sounds like an amazing component! Thank you for sharing it with us!

Could a streamlit dev give some details on how these colors are computed ? This could be helpful when we want to design components that integrate well in any theming of the app.

Regarding your request, @jrieke might be able to direct you to the right people.

Best,
Charly

2 Likes

Hello,
I am trying to launch an app with your component.
On the Streamlit side, I launch it with option --server.port=8080 in order to access it from another computer (hosting computer is an Ubuntu 22.04 without graphical interface).

The aim is to get the sound from my browser (tried with Chrome, Edge and Firefox) and process this flow with speechbrain/asr-wav2vec2-commonvoice-fr to get speech recognition and text summary of what was said.
I wish to use this model because it is quite small and my graphics card is “small” (Nvidia Geforce 1050Ti with 4Gb RAM), impossible for example to use Nvidia Canary which is too big.
I checked the settings on my browsers, Firefox should ask me if I grant access to mic, Chrome and Edge are tuned to authorize.

On the console side of the “server” I get a warning :

/home/ild/miniconda3/envs/sbstt/lib/python3.10/site-packages/streamlit/watcher/local_sources_watcher.py:193: UserWarning: Torchaudio's I/O functions now support par-call bakcend dispatch. Importing backend implementation directly is no longer guaranteed to work. Please use `backend` keyword with load/save/info function, instead of calling the udnerlying implementation directly. lambda m: [p for p in m.__path__._path],
On the web interface, my button stays stuck on “Start recording”, and clicking it has no effect.

Here is my code, just tuned to French :

from streamlit_mic_recorder import mic_recorder # pour enregistrer l'audio
import streamlit as st # pour le rendering sur page web
import io # pour pouvoir manipuler le flux audio
from speechbrain.pretrained import EncoderASR # indispensable au modèle speechbrain
import os # pour pouvoir ouvrir/fermer un fichier
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

def BrainSTT(start_prompt="Enregistrer",stop_prompt="Arrêter",just_once=False,use_container_width=False,language=None,callback=None,args=(),kwargs={}):
    if not '_last_speech_to_text_transcript_id' in st.session_state:
        st.session_state._last_speech_to_text_transcript_id=0
    if not '_last_speech_to_text_transcript' in st.session_state:
        st.session_state._last_speech_to_text_transcript=None
    audio = mic_recorder(start_prompt=start_prompt,stop_prompt=stop_prompt,just_once=just_once,use_container_width=use_container_width)
    new_output=False
    if audio is None:
        output=None
    else:
        st.write("audio n'est pas none")
        st.write("id audio : " + audio[id])
        id=audio['id']
        st.write("id = audio[id]")
        new_output=(id>st.session_state._last_speech_to_text_transcript_id)
        st.write("new_output = id>st.session_state._last_speech_to_text_transcript_id")
        if new_output:
            st.write("new_output n'est pas null")
            output=None
            st.write("output = None")
            st.session_state._last_speech_to_text_transcript_id=id
            st.write("st.session_state._last_speech_to_text_transcript_id = id")
            audio_BIO = io.BytesIO(audio['bytes'])
            st.write("audio_BIO se remplit")
            audio_BIO.name='audio.mp3'
            st.write("audio_BIO.name = 'audio.mp3'")
            success=False
            st.write("success = False")
            err=0
            st.write("err =" + err)
            while not success and err<3: #Retry up to 3 times in case ...
                try:
                    st.write("début du try")
                    transcript = st.session_state.openai_client.audio.transcriptions.create(
                        model="speechbrain/asr-wav2vec2-commonvoice-fr",
                        file=audio_BIO,
                        language=language
                    )
                    st.write("transcript en cours")
                except Exception as e:
                    print(str(e)) # log the exception in the terminal
                    err+=1
                else:
                    st.write("transcription finie")
                    success=True
                    st.write ("success = " + success)
                    output=transcript.text
                    st.write("output = transcript.text")
                    st.session_state._last_speech_to_text_transcript=output
                    st.write("st.session_state._last_speech_to_text_transcript = output")
        elif not just_once:
            output=st.session_state._last_speech_to_text_transcript
            st.write("output = st.session_state._last_speech_to_text_transcript")
        else:
            output=None
            st.write("dernier output = None")
    if new_output and callback:
        st.write("new_output and callback:")
        callback(*args,**kwargs)
    return output
    

text=BrainSTT(language='fr')
if text:
    st.write(text)

What should I change to get it working?
Thank you :slight_smile:

  • Edited because my code snippet was not correctly enclosed

Thanks for your efforts.

I appreciate if you can guide me to replace the text written on the button of recording “Start Recording” to an image or emoticon and the same for “Stop Recording” .
Is that possible?

Thanks for your interest

Hello, it’s just in this line :

def WhisperSTT(openai_api_key=None,**start_prompt="Start recording"**,**stop_prompt="Stop recording"**,just_once=False,use_container_width=False,language=None,callback=None,args=(),kwargs={},key=None):

I did it and I can confirm that the label on the button changes :wink:

Nice! I’m testing this for a future release of an streamlit-powered app I wrote recently (pyrobbot) and it seems to work like a charm. Kudos! :slight_smile:

1 Like

Hello @slain,

I don’t know, you’re using the component in a setting I’m not familiar with.
I would obviously recommend checking first if you manage to record and play the recorded audio using the component within your frontend/backend setting. Once you get to record and play the audio, it means the component is working fine so the problem is elsewhere, then add layers of complexity step by step to locate the issue.

A few things catch my attention though:

  • You import EncoderASR but don’t seem to use it.
  • You declare a cuda/cpu device but don’t seem to use it
  • You use the OpenAI client sdk which is basicaly an interface to OpenAI REST API to make inference calls to the speechbrain model ? make sure the model you’re calling is served from OpenAI endpoints. I’ve seen people using the OpenAI client to dial with HuggingFace models, but this involved changing the OpenAI client “base_url” parameter…

But this makes little sense to me as, how I see it, the whole point of using torch is to run the inference locally on your GPU/CPU. So calling the OpenAI server in this setting seems weird.

After a quick look at your speechbrain model documentation, here, they suggest configuring and using the encoder like so:

from speechbrain.inference.ASR import EncoderASR

asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-fr", savedir="pretrained_models/asr-wav2vec2-commonvoice-fr")
asr_model.transcribe_file('speechbrain/asr-wav2vec2-commonvoice-fr/example-fr.wav')

In such case you wouldn’t even need the OpenAI client at all.

Hope this helps.

Cheers!

1 Like

Hello @paulovcmedeiros !
Thanks for the nice comment. Happy that you like it !
All the best for your new app.
Cheers!

Hello and thank you for your answer.
I think your tips will help me improve the Python part of my code.
Then, I will have to mix Html/js to get an audio record from a user’s microphone, PHP/Ajax to save in on the server and Python to transform this audio in a text file.

Speechbrain is here to do all this from French to French on a 4GB Nvidia card (that’s why I tried to use Cuda).

My problem is not a Streamlit problem so I will stop corruption this thread, and go back to work :wink:

1 Like

Nice!

1 Like

Hello @Joti_Gokaraju !

Thanks for the comment. Did you manage to make it work for your app ? I tried your app and everything seems to work fine on mobile for me (firefox for android).

Cheers!

Baptiste

Dear Baptiste,

Thanks for this beatiful component, it is very useful. Our app is stable on linux,windows, and on mobile android only. However the component is not stable or most of times invisible on ios both imac and iphone. It is not browser related we tested it with different browsers. We couldn’t find the reason may be there is easy solution on this in your mind. all we can add now is ios doesnt support wav file type.

Thanks in advance and thank you again for your support to streamlit community.

Yep! Works great now!

did you try on ios? there is a problem.

Thank you very much! The speech recognition capability of the model is fantastic! But I am not able to change the style of the record button… I am using a windows system.The application that I am currently working on is for people who are partially visually challenged, so I intend to use a really big button that changes color on hovering for recording so that it is easy for them to use… So is there a way to change the font-size, shape and background color of the button? I tried using stylable_container, but it only works on the external container and does not change the styling of the button either… Can someone please tell me how this can be done? Thanks, again! Good day!

@Crystal_FireSword You need to go to frontend part of the code in streamlit_mic_recorder package where there is folder called static and go to css and edit the line like adding font size or padding ```.myButton{margin:4px;background-color:var(–background-color);border:1px solid var(–secondary-background-color);border-radius:8px;color:var(–text-color);cursor:pointer;font-family:var(–font);font-size:45px;padding:5px 25px}
/# sourceMappingURL=main.27af0a20.css.map/````
Note:- you need to use this whole package as same and render it. this is the only option. if you reinstall package the front end won’t change it will back to normal, so use this as it is as a folder in your environment and import it from there.

@hemekci I am also facing the same issue on imac in chrome browser it is working but on safari not and on ios it is not working on safari or any browser. Did you get any solution? for this @B4PT0R your help is also much appreciated

@slain
I came across the same warning as you posted. Could you share what you were able to find?

Thank you

@ B4PT0R

Thank you for making this. This has helped me quite a bit. I am having issues with the button when accessing it from a mobile browser window. Any tips on what I can look at to get that to work?