Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
- URL: http://arxiv.org/abs/2503.18803v1
- Date: Mon, 24 Mar 2025 15:48:07 GMT
- Title: Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
- Authors: Duowang Zhu, Xiaohu Huang, Haiyan Huang, Hao Zhou, Zhenfeng Shao,
- Abstract summary: We present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling.<n>By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences.
- Score: 14.128228451821712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.
Related papers
- Gaussian Difference: Find Any Change Instance in 3D Scenes [6.726861447317518]
Instance-level change detection in 3D scenes presents significant challenges.<n>This paper introduces a novel approach for detecting changes in real-world scenarios.<n>Our approach offers improved detection accuracy, robustness to lighting changes, and efficient processing times.
arXiv Detail & Related papers (2025-02-24T08:09:49Z) - Multi-View Pose-Agnostic Change Localization with Zero Labels [4.997375878454274]
We propose a label-free, pose-agnostic change detection method that integrates information from multiple viewpoints.<n>With as few as 5 images of the post-change scene, our approach can learn an additional change channel in a 3DGS.<n>Our change-aware 3D scene representation additionally enables the generation of accurate change masks for unseen viewpoints.
arXiv Detail & Related papers (2024-12-05T06:28:54Z) - Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z) - Zero-Shot Scene Change Detection [14.095215136905553]
Our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames.<n>We extend our approach to video, leveraging rich temporal information to enhance the performance of scene change detection.
arXiv Detail & Related papers (2024-06-17T05:03:44Z) - Changes-Aware Transformer: Learning Generalized Changes Representation [56.917000244470174]
We propose a novel Changes-Aware Transformer (CAT) for refining difference features.
The generalized representation of various changes is learned straightforwardly in the difference feature space.
After refinement, the changed pixels in the difference feature space are closer to each other, which facilitates change detection.
arXiv Detail & Related papers (2023-09-24T12:21:57Z) - The Change You Want to See (Now in 3D) [65.61789642291636]
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene.
We contribute a change detection model that is trained entirely on synthetic data and is class-agnostic.
We release a new evaluation dataset consisting of real-world image pairs with human-annotated differences.
arXiv Detail & Related papers (2023-08-21T01:59:45Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - Self-Pair: Synthesizing Changes from Single Source for Object Change
Detection in Remote Sensing Imagery [6.586756080460231]
We train a change detector using two spatially unrelated images with corresponding semantic labels such as building.
We show that manipulating the source image as an after-image is crucial to the performance of change detection.
Our method outperforms existing methods based on single-temporal supervision.
arXiv Detail & Related papers (2022-12-20T13:26:42Z) - The Change You Want to See [91.3755431537592]
Given two images of the same scene, being able to automatically detect the changes in them has practical applications in a variety of domains.
We tackle the change detection problem with the goal of detecting "object-level" changes in an image pair despite differences in their viewpoint and illumination.
arXiv Detail & Related papers (2022-09-28T18:10:09Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.