Can One Model Fit All? An Exploration of Wav2Lip’s Lip-Syncing Generalizability Across Culturally Distinct Languages


Rafiei Oskooei A., Yahsi E., Sungur M., ÖZÇOBAN M. Ş.

24th International Conference on Computational Science and Its Applications, ICCSA 2024, Ha-Noi, Vietnam, 1 - 04 July 2024, vol.14819 LNCS, pp.149-164 identifier

  • Publication Type: Conference Paper / Full Text
  • Volume: 14819 LNCS
  • Doi Number: 10.1007/978-3-031-65282-0_10
  • City: Ha-Noi
  • Country: Vietnam
  • Page Numbers: pp.149-164
  • Keywords: Computer Vision, Deep Learning, Face-to-Face Translation, Generative AI, Lip Sync, Talking-face Generation
  • Yıldız Technical University Affiliated: Yes

Abstract

This study explores the potential of Wav2Lip, a state-of-the-art lip-sync model, in multilingual environments. We assess its performance in generating lip-synchronized videos for Turkish, Persian, and Arabic languages. The evaluation results reveal promising language independence for Wav2Lip, achieving comparable accuracy to English. The research identifies the gap in research on lip-sync models for diverse languages and emphasizes the need for broader exploration. Additionally, we introduce a comprehensive Face-to-Face Translation workflow, outlining the fundamental elements for a seamless cross-lingual communication system. This work highlights the importance of Lip Sync models and the potential of Wav2Lip within such a system. By acknowledging current limitations and advocating for advancements in real-time models and high-resolution datasets, this study lays the groundwork for the development of revolutionary Face-to-Face Translation systems, fostering a future of barrier-free communication.