BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
- URL: http://arxiv.org/abs/2509.05895v1
- Date: Sun, 07 Sep 2025 02:16:18 GMT
- Title: BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
- Authors: Yujie Li, Wenjia Xu, Yuanben Zhang, Zhiwei Wei, Mugen Peng,
- Abstract summary: Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment.<n>Previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes.<n>We propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability.
- Score: 24.844748050706468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.
Related papers
- Towards Understanding Multimodal Fine-Tuning: Spatial Features [25.349396112139214]
Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model.<n>We present the first mechanistic analysis of VLM adaptation using stage-wise model diffing.
arXiv Detail & Related papers (2026-02-06T18:48:18Z) - GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates [48.65964582402597]
Vision-language tracking has gained increasing attention in many scenarios.<n>Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features.<n>We introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image.
arXiv Detail & Related papers (2026-01-31T07:24:56Z) - DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception [0.846600473226587]
We introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering.<n>We propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA.<n>DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information.
arXiv Detail & Related papers (2025-07-30T03:14:27Z) - TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting [8.914172086217185]
We study the capabilities of multimodal large language models (MLLMs) on a novel task that jointly targets temporal change understanding and future scene generation.<n>We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image understanding and forecasting.
arXiv Detail & Related papers (2025-06-23T17:26:16Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning [22.54577327204281]
Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions.
Existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise.
We propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics.
arXiv Detail & Related papers (2024-08-30T03:28:40Z) - Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance [19.663899648983417]
We introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance.
We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets.
arXiv Detail & Related papers (2024-07-19T05:07:41Z) - Transformer for Multitemporal Hyperspectral Image Unmixing [17.365895881435563]
We propose the Multitemporal Hyperspectral Image Unmixing Transformer (MUFormer), an end-to-end unsupervised deep learning model.
We introduce two key modules: the Global Awareness Module (GAM) and the Change Enhancement Module (CEM)
The synergy between these modules allows for capturing semantic information regarding endmember and abundance changes.
arXiv Detail & Related papers (2024-07-15T04:02:01Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.