IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025 (SCI-Expanded)
Existing remote sensing change captioning (RSICC) methods often fail under challenges like illumination differences, viewpoint changes, and blur effects, leading to inaccuracies, especially in no-change regions. Moreover, images acquired at different spatial resolutions and with registration errors tend to affect the captions. To address these issues, we introduce SECOND-CC, a novel RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation maps, and diverse realworld scenarios. SECOND-CC contains 6 041 pairs of bitemporal remote sensing images and 30 205 sentences describing the differences between the images. Additionally, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including Cross- Modal Cross Attention and Multimodal Gated Cross Attention. In addition, we adapt MModalCC to handle noisy semantic inputs by integrating a Semantic Change Detector, improving its robustness for real-world applications. Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address the challenges of RSICC. Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score in SECOND-CC dataset. MModalCC was further validated on the LEVIR-MCI benchmark, where it achieved an average S*m score of 83.51, significantly outperforming previous state-of-the-art methods. We will make our dataset and codebase publicly available to facilitate future research at https://github.com/ChangeCapsInRS/SecondCC.