Can One Model Fit All? An Exploration of Wav2Lip’s Lip-Syncing Generalizability Across Culturally Distinct Languages

Rafiei Oskooei A., Yahsi E., Sungur M., ÖZÇOBAN M. Ş.

24th International Conference on Computational Science and Its Applications, ICCSA 2024, Ha-Noi, Vietnam, 1 - 04 Temmuz 2024, cilt.14819 LNCS, ss.149-164

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 14819 LNCS
Doi Numarası: 10.1007/978-3-031-65282-0_10
Basıldığı Şehir: Ha-Noi
Basıldığı Ülke: Vietnam
Sayfa Sayıları: ss.149-164
Anahtar Kelimeler: Computer Vision, Deep Learning, Face-to-Face Translation, Generative AI, Lip Sync, Talking-face Generation
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

This study explores the potential of Wav2Lip, a state-of-the-art lip-sync model, in multilingual environments. We assess its performance in generating lip-synchronized videos for Turkish, Persian, and Arabic languages. The evaluation results reveal promising language independence for Wav2Lip, achieving comparable accuracy to English. The research identifies the gap in research on lip-sync models for diverse languages and emphasizes the need for broader exploration. Additionally, we introduce a comprehensive Face-to-Face Translation workflow, outlining the fundamental elements for a seamless cross-lingual communication system. This work highlights the importance of Lip Sync models and the potential of Wav2Lip within such a system. By acknowledging current limitations and advocating for advancements in real-time models and high-resolution datasets, this study lays the groundwork for the development of revolutionary Face-to-Face Translation systems, fostering a future of barrier-free communication.