Open Source Song Transcript and Video Caption Generator with Whisper

smaranjitghose · December 27, 2022, 3:27am

About:

I am working on building a product to serve clients on automating internal meeting summaries, song transcription for budding singers and product video review analysis in real time at an affordable cost.

Wanted to share a MVP I built in the initial stages for demonstration that would be useful for the community to build upon and target similar use cases in their respective areas of interest.

Features:

Grab any video from YouTube and generate Captions (which can be saved as SRT or VTT file) side by side of the video + segregated audio
Generate accurate lyrics for a song

To be updated:

Speaker Diarization
Word Level Captioning + Burning to Video

Demo Snapshot

Github Repo:

Click Here

I look forward to suggestions from the community . Any pointers towards streamlit-specific resources to host such GPU resource heavy applications on platforms such as VULTR (I have taken care of dockerization and tested with Heroku already) would be a great help!

TomJohn · December 27, 2022, 3:57pm

Interesting one. Probably more futureproof than downloading captions directely from youtube using youtube_transcript_api. It would be a nice feature to see how generated transcript differes from one downloaded directely

As I’ve also was playing aroung with youtube videos summarization it would be nice to share experiences.

Are you planning to use pre-trained summarization models using hugging face transformer library or openai API?
How do you approach splitting text into chunks that models can handle?

smaranjitghose · December 28, 2022, 9:46pm

Dear @TomJohn,

Thanks for checking out the project.

- The Whisper model generates better transcripts than the auto caption generator feature of YouTube due the fact that OpenAI seemed to have trained it (especially the largev2 and medium models) on data having different accents and incorporating multilingual audio of the same magnitude as well.
- However, it completely ignores filler words like “Ah, umm…”. As far as transcripts/subtitles uploaded by YouTube user(s) themselves for the videos, for English, French, German and Japanese the results are quite similar.
- For Indian languages like Tamil or Hindi, the generated transcripts are not as satisfactory
- As far as songs are concerned especially rap and rock songs, it works incredibly well on most languages.
Yes, my actual product involves video summarization + sentiment analysis so I am using pre-trained models from hugging face for those. Will build individual apis for those and integrate with streamlit
The model(Whisper) can handle any video/audio for 1.5 hours easily. So generation of subtitles/transcripts is not an issue. As far as summarization is concerned with the generated transcript, the models based on Google T5 on huggingspace can handle even entire books so currently I did not encounter any requirement for splitting the text

TomJohn · December 29, 2022, 8:57am

Thank you for detailed answer!

system · December 29, 2023, 8:58am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Youtube Transcript App Show the Community! streamlit-cloud	2	691	July 2, 2024
Text2Video generation app Show the Community! video , text-input , streamlit-cloud , ai , llms	2	803	April 1, 2024
🗣️ Voiceover Video Generator – Turn Your SRTs into Narrated Videos in Minutes! Show the Community! streamlit-cloud , discussion , streamlit	1	54	June 9, 2025
New open-source app for summarizing and chatting (Q&A) with YouTube videos! Show the Community! streamlit-cloud , llms , openai , build-with-streamlit	1	280	January 3, 2025
New app - Search Youtube videos by text Show the Community! streamlit-cloud	7	1856	November 9, 2023

Open Source Song Transcript and Video Caption Generator with Whisper

About:

Features:

Demo Snapshot

Github Repo:

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies