Workshops of the International Conference on Computational Science and Its Applications, ICCSA 2025, İstanbul, Türkiye, 30 Haziran - 03 Temmuz 2025, cilt.15890 LNCS, ss.217-234, (Tam Metin Bildiri)
This paper addresses the growing need for seamless communication in multilingual video conferencing by presenting a novel, computationally efficient methodology for real-time video translation. While advancements in neural networks have enabled accurate speech translation and voice cloning, integrating these with lip synchronization for realistic talking head generation remains a challenge, particularly for real-time applications. This paper introduces a comprehensive video translation pipeline leveraging open-source deep learning models. We further propose a scalable system architecture incorporating a “Token Ring” mechanism to manage speaker turns and minimize computational load, addressing key challenges related to latency, scalability, and personalization in multilingual settings. A segmented batched processing protocol with inverse throughput thresholding and overlapping buffering is implemented to achieve near real-time performance. A simplified, universal prototype is developed to demonstrate the feasibility and efficacy of our approach, providing a foundation for building next-generation multilingual video conferencing systems. This work offers a practical framework for developers and businesses aiming to create inclusive and effective communication platforms.