ChangeViT: Unleashing Plain Vision Transformers for Change Detection
- URL: http://arxiv.org/abs/2406.12847v1
- Date: Tue, 18 Jun 2024 17:59:08 GMT
- Title: ChangeViT: Unleashing Plain Vision Transformers for Change Detection
- Authors: Duowang Zhu, Xiaohu Huang, Haiyan Huang, Zhenfeng Shao, Qimin Cheng,
- Abstract summary: ChangeViT is a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes.
The framework achieves state-of-the-art performance on three popular high-resolution datasets.
- Score: 3.582733645632794
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper, our study uncovers ViTs' unique advantage in discerning large-scale changes, a capability where CNNs fall short. Capitalizing on this insight, we introduce ChangeViT, a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes. This framework is supplemented by a detail-capture module that generates detailed spatial features and a feature injector that efficiently integrates fine-grained spatial information into high-level semantic learning. The feature integration ensures that ChangeViT excels in both detecting large-scale changes and capturing fine-grained details, providing comprehensive change detection across diverse scales. Without bells and whistles, ChangeViT achieves state-of-the-art performance on three popular high-resolution datasets (i.e., LEVIR-CD, WHU-CD, and CLCD) and one low-resolution dataset (i.e., OSCD), which underscores the unleashed potential of plain ViTs for change detection. Furthermore, thorough quantitative and qualitative analyses validate the efficacy of the introduced modules, solidifying the effectiveness of our approach. The source code is available at https://github.com/zhuduowang/ChangeViT.
Related papers
- Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z) - Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos.
It examines their potential for improved generalization and explainability, especially with limited training data.
By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z) - ChangeBind: A Hybrid Change Encoder for Remote Sensing Change Detection [16.62779899494721]
Change detection (CD) is a fundamental task in remote sensing (RS) which aims to detect the semantic changes between the same geographical regions at different time stamps.
We propose an effective Siamese-based framework to encode the semantic changes occurring in the bi-temporal RS images.
arXiv Detail & Related papers (2024-04-26T17:47:14Z) - Explicit Change Relation Learning for Change Detection in VHR Remote
Sensing Images [12.228675703851733]
We propose a network architecture NAME for the explicit mining of change relation features.
The change features of change detection should be divided into pre-changed image features, post-changed image features and change relation features.
Our network performs better, in terms of F1, IoU, and OA, than those of the existing advanced networks for change detection.
arXiv Detail & Related papers (2023-11-14T08:47:38Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - EMS-Net: Efficient Multi-Temporal Self-Attention For Hyperspectral
Change Detection [32.23764287942984]
We have proposed an original Efficient Multi-temporal Self-attention Network (EMS-Net) for hyperspectral change detection.
EMS-Net cuts redundancy of those similar and containing-no-changes feature maps, computing efficient multi-temporal change information for precise binary change map.
Experiments implemented on two hyperspectral change detection datasets manifests the out-standing performance and validity of proposed method.
arXiv Detail & Related papers (2023-03-24T02:11:22Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - DASNet: Dual attentive fully convolutional siamese networks for change
detection of high resolution satellite images [17.839181739760676]
The research objective is to identity the change information of interest and filter out the irrelevant change information as interference factors.
Recently, the rise of deep learning has provided new tools for change detection, which have yielded impressive results.
We propose a new method, namely, dual attentive fully convolutional Siamese networks (DASNet) for change detection in high-resolution images.
arXiv Detail & Related papers (2020-03-07T16:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.