Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework

Karaca, Ali; Ozelbas, Muhammed; Berber, Saadettin; Karimli, Orkhan; Yıldırım, Tülay; Amasyalı, Mehmet

doi:10.1109/jstars.2025.3600613

Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework

Karaca A. C., Ozelbas E., Berber S., Karimli O., Yıldırım T., Amasyalı M. F.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, cilt.18, ss.21494-21513, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 18
Basım Tarihi: 2025
Doi Numarası: 10.1109/jstars.2025.3600613
Dergi Adı: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Aquatic Science & Fisheries Abstracts (ASFA), Compendex, Geobase, INSPEC, Directory of Open Access Journals, Civil Engineering Abstracts
Sayfa Sayıları: ss.21494-21513
Anahtar Kelimeler: Change captioning, multimodal change captioning (MModalCC), remote sensing (RS) images
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Existing remote sensing image change captioning (RSICC) methods often fail under challenges, such as illumination differences, viewpoint changes, and blur effects, leading to inaccuracies, especially in no-change regions. Moreover, images acquired at different spatial resolutions and with registration errors tend to affect the captions. To address these issues, we introduce SECOND-CC, a novel RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation maps, and diverse real-world scenarios. SECOND-CC contains 6041 pairs of bitemporal remote sensing images and 30 205 sentences describing the differences between the images. In addition, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including cross-modal cross attention and multimodal gated cross attention. In addition, we adapt MModalCC to handle noisy semantic inputs by integrating a semantic change detector, improving its robustness for real-world applications. Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address the challenges of RSICC. Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score in SECOND-CC dataset. MModalCC was further validated on the LEVIR-MCI benchmark, where it achieved an average S*m score of 83.51, significantly outperforming previous state-of-the-art methods.