Whisper, Translate, Speak, Sync: Video Translation for Multilingual Video Conferencing Using Generative AI

Rafiei Oskooei A., Caglar E., ŞAHİN İ., Kayabay A., AKTAŞ M. S.

Workshops of the International Conference on Computational Science and Its Applications, ICCSA 2025, İstanbul, Türkiye, 30 Haziran - 03 Temmuz 2025, cilt.15890 LNCS, ss.217-234, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 15890 LNCS
Doi Numarası: 10.1007/978-3-031-97606-3_15
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.217-234
Anahtar Kelimeler: Computer Vision, Deep Learning, Generative AI, Human-AI Interaction, Video Conferencing, Video Translation
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

This paper addresses the growing need for seamless communication in multilingual video conferencing by presenting a novel, computationally efficient methodology for real-time video translation. While advancements in neural networks have enabled accurate speech translation and voice cloning, integrating these with lip synchronization for realistic talking head generation remains a challenge, particularly for real-time applications. This paper introduces a comprehensive video translation pipeline leveraging open-source deep learning models. We further propose a scalable system architecture incorporating a “Token Ring” mechanism to manage speaker turns and minimize computational load, addressing key challenges related to latency, scalability, and personalization in multilingual settings. A segmented batched processing protocol with inverse throughput thresholding and overlapping buffering is implemented to achieve near real-time performance. A simplified, universal prototype is developed to demonstrate the feasibility and efficacy of our approach, providing a foundation for building next-generation multilingual video conferencing systems. This work offers a practical framework for developers and businesses aiming to create inclusive and effective communication platforms.