Automated Multilingual Video Translation Framework Using NLLB
The Automated Multilingual Video Translation Framework simplifies video translation by automating audio extraction, transcription, translation, and text-to-speech (TTS) synthesis, ensuring synchronized multilingual outputs. Built using FFmpeg, OpenAI's Whisper, Meta's NLLB, and Google gTTS, it supports Indian languages like Hindi, Tamil, and Kannada, preserving technical accuracy and accessibility for global audiences.
PROJECTS
Abhinav Saluja
1/24/20251 min read
1. Inspiration
With the rapid growth of digital video platforms, content creators face a significant challenge in reaching global audiences due to language barriers. Traditional video translation methods are labor-intensive, costly, and error-prone, particularly for preserving technical jargon and synchronization. This project aims to bridge this gap through automation and AI-driven solutions.
2. What It Does
The framework automates video translation by:
Extracting audio from videos.
Transcribing speech using OpenAI's Whisper model.
Translating text into target languages while preserving critical entities.
Converting translated text to speech using TTS (Text-to-Speech) technology.
Synchronizing the audio with the original video to produce multilingual outputs.
3. How We Built It
Audio Extraction: Utilized FFmpeg for extracting audio from video files.
Transcription: Leveraged OpenAI’s Whisper for high-accuracy audio-to-text conversion with timestamps.
Translation: Incorporated Meta’s NLLB (No Language Left Behind) model to translate text while protecting named entities and technical terms.
Text-to-Speech: Used Google’s gTTS API for generating multilingual audio.
Synchronization: Employed FFmpeg to merge translated audio with video, ensuring alignment and preserving video quality.
4. Challenges We Ran Into
Handling silent or noisy segments in videos.
Ensuring precise synchronization between translated audio and video.
Maintaining the fidelity of technical terms and named entities during translation.
Processing large video files efficiently.
5. Accomplishments That We're Proud Of
Successfully automated end-to-end video translation with minimal human intervention.
Achieved a ROUGE score of 0.609, demonstrating high translation quality.
Enhanced accessibility for non-native speakers across multiple Indian languages (Hindi, Tamil, Kannada, etc.).
Preserved technical accuracy and synchronization, making the framework robust for educational and technical content.
6. What We Learned
The importance of integrating entity recognition to maintain translation accuracy.
Effective use of AI models like Whisper and NLLB in addressing real-world challenges.
Synchronization complexities when dealing with diverse audio-video inputs.
7. What Is Next
Expanding support for additional languages and dialects.
Enhancing TTS capabilities for more natural-sounding audio outputs.
Optimizing processing time for handling large-scale video libraries.
Incorporating user feedback mechanisms to further refine translations.
8. List of Technologies/Tech Stacks Used
Programming Languages: Python
Libraries: FFmpeg, yt-dlp
AI Models: OpenAI’s Whisper, Meta’s NLLB
APIs: Google Text-to-Speech (gTTS)
Others: Regular Expressions (Regex) for timestamp parsing, Named Entity Recognition (NER).
9. Try it out
https://github.com/Sreeharsha-Sadhu/Python-Video-Translation-Framework
10. By:
Sreeharsha Sadhu( sreeharsha.sadhu@gmail.com )
Hemanth Ramasubu( hemanthram078@gmail.com )
Abhinav Saluja( abhinavsaluja2004@gmail.com )
Sricharan Silaparasetty( ssricharan26@gmail.com)
Suhas Kesavan( bl.en.u4cse22060@bl.students.amrita.edu)
Connect
Join us in advancing data science and AI.
© 2025. All rights reserved.

