Multimodal LLM Guidance for Aligning Text-to-Image Generation


Yegin M. N., AMASYALI M. F.

32nd International Conference on Neural Information Processing, ICONIP 2025, Okinawa, Japonya, 20 - 24 Kasım 2025, cilt.2753 CCIS, ss.388-402, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 2753 CCIS
  • Doi Numarası: 10.1007/978-981-95-4088-4_27
  • Basıldığı Şehir: Okinawa
  • Basıldığı Ülke: Japonya
  • Sayfa Sayıları: ss.388-402
  • Anahtar Kelimeler: Diffusion models, Large Language Models, Text-to-image
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Text-to-image diffusion models have achieved great success in both research and applications. Despite their advantages they often struggle in generating objects relationships. Fortunately, large language models have proven their ability to understand the text prompts and visual inputs. However, their full potential in text-to-image generation has yet to be explored. In this study we introduce a new training-free pipeline that leverages several capabilities of Multimodal LLMs on carefully designed steps and achieves enhanced semantic compatibility in text-to-image generation. We showed that our method significantly outperforms existing studies and even competing with cutting-edge giants in Fig. 1. We also introduce a new benchmark dataset containing multiple object relationships from real-life scenes.