Automated Multilingual Video Translation Framework Using NLLB

The Automated Multilingual Video Translation Framework simplifies video translation by automating audio extraction, transcription, translation, and text-to-speech (TTS) synthesis, ensuring synchronized multilingual outputs. Built using FFmpeg, OpenAI's Whisper, Meta's NLLB, and Google gTTS, it supports Indian languages like Hindi, Tamil, and Kannada, preserving technical accuracy and accessibility for global audiences.

PROJECTS

Abhinav Saluja

1/24/20251 min read

1. Inspiration

With the rapid growth of digital video platforms, content creators face a significant challenge in reaching global audiences due to language barriers. Traditional video translation methods are labor-intensive, costly, and error-prone, particularly for preserving technical jargon and synchronization. This project aims to bridge this gap through automation and AI-driven solutions.

2. What It Does

The framework automates video translation by:

  • Extracting audio from videos.

  • Transcribing speech using OpenAI's Whisper model.

  • Translating text into target languages while preserving critical entities.

  • Converting translated text to speech using TTS (Text-to-Speech) technology.

  • Synchronizing the audio with the original video to produce multilingual outputs.

3. How We Built It

  • Audio Extraction: Utilized FFmpeg for extracting audio from video files.

  • Transcription: Leveraged OpenAI’s Whisper for high-accuracy audio-to-text conversion with timestamps.

  • Translation: Incorporated Meta’s NLLB (No Language Left Behind) model to translate text while protecting named entities and technical terms.

  • Text-to-Speech: Used Google’s gTTS API for generating multilingual audio.

  • Synchronization: Employed FFmpeg to merge translated audio with video, ensuring alignment and preserving video quality.

4. Challenges We Ran Into

  • Handling silent or noisy segments in videos.

  • Ensuring precise synchronization between translated audio and video.

  • Maintaining the fidelity of technical terms and named entities during translation.

  • Processing large video files efficiently.

5. Accomplishments That We're Proud Of

  • Successfully automated end-to-end video translation with minimal human intervention.

  • Achieved a ROUGE score of 0.609, demonstrating high translation quality.

  • Enhanced accessibility for non-native speakers across multiple Indian languages (Hindi, Tamil, Kannada, etc.).

  • Preserved technical accuracy and synchronization, making the framework robust for educational and technical content.

6. What We Learned

  • The importance of integrating entity recognition to maintain translation accuracy.

  • Effective use of AI models like Whisper and NLLB in addressing real-world challenges.

  • Synchronization complexities when dealing with diverse audio-video inputs.

7. What Is Next

  • Expanding support for additional languages and dialects.

  • Enhancing TTS capabilities for more natural-sounding audio outputs.

  • Optimizing processing time for handling large-scale video libraries.

  • Incorporating user feedback mechanisms to further refine translations.

8. List of Technologies/Tech Stacks Used

  • Programming Languages: Python

  • Libraries: FFmpeg, yt-dlp

  • AI Models: OpenAI’s Whisper, Meta’s NLLB

  • APIs: Google Text-to-Speech (gTTS)

  • Others: Regular Expressions (Regex) for timestamp parsing, Named Entity Recognition (NER).

9. Try it out

https://github.com/Sreeharsha-Sadhu/Python-Video-Translation-Framework

10. By:

  1. Sreeharsha Sadhu( sreeharsha.sadhu@gmail.com )

  2. Hemanth Ramasubu( hemanthram078@gmail.com )

  3. Abhinav Saluja( abhinavsaluja2004@gmail.com )

  4. Sricharan Silaparasetty( ssricharan26@gmail.com)

  5. Suhas Kesavan( bl.en.u4cse22060@bl.students.amrita.edu)