[Solved] Check file extension after uploading - transcript an audio record

Hello,

I’m working on a small interface to allow my users to upload an audio file and then transcript it using a model like speechbrain (this may change if the results with speechbrain are not satisfying).

At this point (as I am not a dev, and really beginner in Python, I just copied/pasted some pieces of code found on the web, the true work I did on this language is about 1 hour) I would like to check if my users won’t try to upload a non-audio file.

I created my uploader button like this :

audio_source=st.file_uploader(label="Choisir votre fichier", type=['wav','m4a','mp3','wma'])

Then, I understand that I have to check that the user uploaded a file :

if audio_source is not None:

But after this, does the type parameter will prevent my users to upload a non-usable file (video, .ini file, image) or do I need to check the 4 last characters in the name of the uploaded file to ensure it will be .mp3, .m4a, .wav or .wma?
And in the case it is better to check, which Python function can I use?

Thank you :slight_smile:

Edit : there was a typo in the upload_button, I forgot a closing ]

I tried going on before getting answers, here is the full code of the script :

import streamlit as st
from speechbrain.inference.ASR import WhisperASR
import torch

# device = "cuda:0" if torch.cuda.is_available() else "cpu" # à vérifier, semble ne pas être utile

st.title("Veuillez téléverser un fichier audio pour lancer la transcription")

col1 = st.columns(1) # disposition de l'affichage : une barre à gauche pour les boutons, une colonne sur le reste pour afficher la transcription

audio_source=st.sidebar.file_uploader(label="Choisir votre fichier", type=['wav','m4a','mp3','wma']) # bouton de téléchargement

if audio_source is not None: # on vérifie qu'un fichier a été téléversé
    asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-fr", savedir="pretrained_models/asr-wav2vec2-commonvoice-fr", run_opts={"device":"cuda"})
    predicted_text = asr_model.transcribe_file(audio_source)
    col1.write("Texte transcrit")
    col1.write(predicted_text)
    st.sidebar.download_button(label="Télécharger la transcription", data=predicted_text, file_name='transcript.txt',mime='text/plain')

My console throws me an error after uploading a small m4a file (< 1MB) :

Traceback (most recent call last):
  File "/home/ild/miniconda3/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)
  File "/home/ild/transcript.py", line 16, in <module>
    predicted_text = asr_model.transcribe_file(audio_source)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ild/miniconda3/lib/python3.11/site-packages/speechbrain/inference/ASR.py", line 406, in transcribe_file
    waveform = self.load_audio(path)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ild/miniconda3/lib/python3.11/site-packages/speechbrain/inference/interfaces.py", line 281, in load_audio
    path = fetch(fl, source=source, savedir=savedir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ild/miniconda3/lib/python3.11/site-packages/speechbrain/utils/fetching.py", line 121, in fetch
    destination = savedir / save_filename
                  ~~~~~~~~^~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for /: 'PosixPath' and 'UploadedFile'

I understand that the error is caught when executing line 16, but I don’t know what the error is, what is the mistake or forget?

Edit : I tried to add import transformers at the top of the file, it just moves the error to line 17, so the error does not come from not importing transformers.

Thank you :slight_smile:

Here’s a good way to get the extension:

import streamlit as st
from pathlib import Path

source = st.file_uploader(label="Upload", type=["txt", "csv"])

if source is not None:
    st.write(source.name)
    st.write("File extension:")
    st.write(Path(source.name).suffix)

With that issue, it’s most likely because the library is expecting an actual file path – an easy way to get an actual file path from an uploaded file is to use NamedTemporaryFile

import streamlit as st
from tempfile import NamedTemporaryFile
from pathlib import Path

source = st.file_uploader(label="Upload", type=["txt", "csv"])

if source is not None:
    st.write(source.name)
    st.write("File extension:")
    suffix = Path(source.name).suffix
    st.write(suffix)

    with NamedTemporaryFile(suffix=suffix) as temp_file:
        temp_file.write(source.getvalue())
        temp_file.seek(0)
        st.write(temp_file.name)

        st.write("File contents:")
        st.write(temp_file.read())

Thank you, it will allow me to check if the file extension is good.
I also checked by trying to upload an Excel file with my uploader, it already throws me a message stating that the file type is not allowed.
This tip will help me when I use Python without streamlit :wink:

I’m a little bit lost, the error message made me think the problem was at the moment I call asr_model.transcribe(audio_source)?

So I should add something like :

if audio_source is not None:
   audio_path = Path(audio_source.name)
   asr_model.transcribe_file(audio_path)

Or do I need to force saving the audio file to my server and then force the path?
Something like that?

if audio_source is not None:
   suffix = Path(audio_source.name).suffix
   if suffix = "wma":
      # here I have to write audio_source in "/var/tmp/record.wma"
      audio_file = open("/var/tmp/record.wma", "w")
      # and then I could call asr_model.transcribe_file:
      asr_model.transcribe_file("/var/tmp/record.wma")
# and so on with mp3, m4a, wav...

Is there a concern about disk space with this method? Currently my “server” has ~230GB, I could be able to reach 900GB, but not much more… And the records will be meeting records, for meeting which can last 1 or 2 hours…

Thank you

Edit : found on w3schools that I can create a new file with open and mode “w”

So here is my modified code, I tried to handle the 4 cases allowed, but the error message I get makes me think I need another step between uploading my file and passing to the ASR :

from pathlib import Path
import streamlit as st
from speechbrain.inference.ASR import WhisperASR
import torch

st.title("Veuillez téléverser un fichier audio pour lancer la transcription")

col1 = st.columns(1) # disposition de l'affichage : une barre à gauche pour les boutons, une colonne sur le reste pour afficher la transcription

audio_source=st.sidebar.file_uploader(label="Choisir votre fichier", type=["wav","m4a","mp3","wma"]) # bouton de téléchargement
asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-fr", savedir="pretrained_models/asr-wav2vec2-commonvoice-fr", run_opts={"device":"cuda"})

if audio_source is not None: # on vérifie qu'un fichier a été téléversé
    suffix = Path(audio_source).suffix
    match suffix:
        case "mp3":
            audio_file = open("/var/tmp/record.mp3")
            predicted_text = asr_model.transcribe_file("/var/tmp/record.mp3")
        case "m4a":
            audio_file = open("/var/tmp/record.m4a")
            predicted_text = asr_model.transcribe_file("/var/tmp/record.m4a")
        case "wav":
            audio_file = open("/var/tmp/record.wav")
            predicted_text = asr_model.transcribe_file("/var/tmp/record.wav")
        case "wma":
            audio_file = open("/var/tmp/record.wma")
            predicted_text = asr_model.transcribe_file("/var/tmp/record.wma")

    col1.write("Texte transcrit")
    col1.write(predicted_text)
    st.sidebar.download_button(label="Télécharger la transcription", data=predicted_text, file_name="transcript.txt",mime="text/plain")

And the new error message :

```
File "/home/ild/miniconda3/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)File "/home/ild/transcript.py", line 14, in <module>
    suffix = Path(audio_source).suffix
             ^^^^^^^^^^^^^^^^^^File "/home/ild/miniconda3/lib/python3.11/pathlib.py", line 871, in __new__
    self = cls._from_parts(args)
           ^^^^^^^^^^^^^^^^^^^^^File "/home/ild/miniconda3/lib/python3.11/pathlib.py", line 509, in _from_parts
    drv, root, parts = self._parse_args(args)
                       ^^^^^^^^^^^^^^^^^^^^^^File "/home/ild/miniconda3/lib/python3.11/pathlib.py", line 493, in _parse_args
    a = os.fspath(a)
        ^^^^^^^^^^^^
```

I guess that the best place for this step is just before initializing suffix (because at this point we know that audio_source has been uploaded, and we have to get the file once, not one time for each format).
I just don’t know/understand WHAT I have to do?

Thanks

In your code with open, you’re not actually saving the contents of the file to var/tmp/record.whatever, you’re just trying to open that file. I would recommend using the NamedTemporaryFile as I showed, because

  1. You don’t have to worry about the exact path of the file, you can just trust that Python will figure out a temporary location that you can write to/read from safely
  2. Once you’re done with the file, it will be cleaned up, so you shouldn’t have to worry about disc space.
source = st.file_uploader(label="Upload", type=["mp3", "wav"])

if source is not None:
    suffix = Path(source.name).suffix

    with NamedTemporaryFile(suffix=suffix) as temp_file:
        temp_file.write(source.getvalue())
        temp_file.seek(0)
        predicted_text = asr_model.transcribe_file(temp_file.name)
1 Like

OK, I think I got, thank you for your patience! :slight_smile:

My first error was here :

suffix = Path(audio_source).suffix

The good code is :

suffix = Path(audio_source.name).suffix

This one is OK, I was trying to get a suffix from my data instead of getting it from the filename.

My second error is in the open command, I’m just creating the file, not filling it with my audio content.

I think I’m going on a good way, but now this is CUDA throwing me an error :frowning:

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 3.94 GiB of which 2.69 MiB is free. Including non-PyTorch memory, this process has 3.93 GiB memory in use. Of the allocated memory 3.86 GiB is allocated by PyTorch, and 12.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Here is the “new” version of the code :

from pathlib import Path
from tempfile import NamedTemporaryFile
import streamlit as st
from speechbrain.inference.ASR import WhisperASR
import torch

st.title("Veuillez téléverser un fichier audio pour lancer la transcription")

col1 = st.columns(1) # disposition de l'affichage : une barre à gauche pour les boutons, une colonne sur le reste pour afficher la transcription

audio_source=st.sidebar.file_uploader(label="Choisir votre fichier", type=["wav","m4a","mp3","wma"]) # bouton de téléchargement
asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-fr", savedir="pretrained_models/ ```
asr-whisper-medium-commonvoice-fr
```", run_opts={"device":"cuda"})

if audio_source is not None: # on vérifie qu'un fichier a été téléversé
    st.write("Transcription en cours ...")
    predicted_text = "None"
    suffix = Path(audio_source.name).suffix

    with NamedTemporaryFile(suffix=suffix) as temp_file:
        temp_file.write(audio_source.getvalue())
        temp_file.seek(0)
        predicted_text = asr_model.transcribe_file(temp_file.name)
    st.write("Fichier transcrit :")
    st.write(predicted_text)
    st.sidebar.download_button(label="Télécharger la transcription", data=predicted_text, file_name="transcript.txt",mime="text/plain")

I am surprised because the model’s size seems to be +/- 2.9 to 3.06GB, wouldn’t it run on a 4GB graphics card?
Or should I try to run the script on the CPU by removing , run_opts={“device”:“cuda”} ?

OK, so I tried without running the ASR on the graphics card, it works.
But I wonder why I’m unable to use CUDA with 4GB for the graphics memory with a 3GB model?

Does anyone have an idea of the problem at this level?
I don’t think the problem comes from the graphical interface, it’s an ubuntu server with just command line interface, impossible that it uses more than some megabytes?

Hello,
I think I will edit the topic as solved, because I have no more errors.
Now, I ust have to feed it with some audio records to check if the speech recognition works fine.

I give my final code, if this can help someone :slight_smile:

from pathlib import Path
from tempfile import NamedTemporaryFile
import streamlit as st
from speechbrain.inference.ASR import WhisperASR
import torch

st.title("Veuillez téléverser un fichier audio pour lancer la transcription")

col1 = st.columns(1) # disposition de l'affichage : une barre à gauche pour les boutons, une colonne sur le reste pour afficher la transcription

audio_source=st.sidebar.file_uploader(label="Choisir votre fichier", type=["wav","m4a","mp3","wma"]) # bouton de téléchargement
asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-fr", savedir="pretrained_models/asr-whisper-medium-commonvoice-fr")

if audio_source is not None: # on vérifie qu'un fichier a été téléversé
    st.write("Transcription en cours ...")
    predicted_text = "None"
    suffix = Path(audio_source.name).suffix

    with NamedTemporaryFile(suffix=suffix) as temp_file:
        temp_file.write(audio_source.getvalue())
        temp_file.seek(0)
        predicted_text = asr_model.transcribe_file(temp_file.name)
    st.write("Fichier transcrit :")
    st.write(predicted_text)
    st.sidebar.download_button(label="Télécharger la transcription", data=predicted_text, file_name="transcript.txt",mime="text/plain")

The script seems be running like that on the GPU, because Nvidia-smi stands that 4028MiB are used on the GTX 1050Ti, and there is a Python process using 4024MiB.

If somebody wants to use this code, in French or another laznguage with another model, no problem for me, I wouldn’t have been able to do the script without @blackary 's help and I think it wouldn’t be honest to keep the solution just for myself.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.