RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning
- URL: http://arxiv.org/abs/2409.13366v1
- Date: Fri, 20 Sep 2024 10:03:14 GMT
- Title: RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning
- Authors: Wenhui Diao, Haichen Yu, Kaiyue Kang, Tong Ling, Di Liu, Yingchao Feng, Hanbo Bi, Libo Ren, Xuexue Li, Yongqiang Mao, Xian Sun,
- Abstract summary: This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision.
The model's detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS.
Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks.
- Score: 12.442430013205131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model's detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model's adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and effectiveness of RingMo-Aerial in enhancing the performance of ARS vision tasks.
Related papers
- Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning [65.31677646659895]
This paper focuses on the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT.
We introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks.
arXiv Detail & Related papers (2024-09-02T08:10:51Z) - Integrated Dynamic Phenological Feature for Remote Sensing Image Land Cover Change Detection [5.109855690325439]
We introduce the InPhea model, which integrates phenological features into a remote sensing image CD framework.
A constrainer with four constraint modules and a multi-stage contrastive learning approach is employed to aid in the model's understanding of phenological characteristics.
Experiments on the HRSCD, SECD, and PSCD-Wuhan datasets reveal that InPhea outperforms other models.
arXiv Detail & Related papers (2024-08-08T01:07:28Z) - Parameter-Efficient Active Learning for Foundational models [7.799711162530711]
Foundational vision transformer models have shown impressive few shot performance on many vision tasks.
This research presents a novel investigation into the application of parameter efficient fine-tuning methods within an active learning (AL) framework.
arXiv Detail & Related papers (2024-06-13T16:30:32Z) - SOAR: Advancements in Small Body Object Detection for Aerial Imagery Using State Space Models and Programmable Gradients [0.8873228457453465]
Small object detection in aerial imagery presents significant challenges in computer vision.
Traditional methods using transformer-based models often face limitations stemming from the lack of specialized databases.
This paper introduces two innovative approaches that significantly enhance detection and segmentation capabilities for small aerial objects.
arXiv Detail & Related papers (2024-05-02T19:47:08Z) - FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision
Transformers [61.48709409150777]
Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks.
Existing large models tend to prioritize performance during training, potentially neglecting the robustness.
We develop a novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module.
We propose the FullLoRA-AT framework by integrating the learnable LNLoRA modules into all key components of ViT-based models.
arXiv Detail & Related papers (2024-01-03T14:08:39Z) - Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for
Advanced Object Detection [55.2480439325792]
We present an in-depth evaluation of an object detection model that integrates the LSKNet backbone with the DiffusionDet head.
The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement.
This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis.
arXiv Detail & Related papers (2023-11-21T19:49:13Z) - Learning Objective-Specific Active Learning Strategies with Attentive
Neural Processes [72.75421975804132]
Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting.
We propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem.
Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives.
arXiv Detail & Related papers (2023-09-11T14:16:37Z) - Adapting Segment Anything Model for Change Detection in HR Remote
Sensing Images [18.371087310792287]
This work aims to utilize the strong visual recognition capabilities of Vision Foundation Models (VFMs) to improve the change detection of high-resolution Remote Sensing Images (RSIs)
We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes.
To utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs.
The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-
arXiv Detail & Related papers (2023-09-04T08:23:31Z) - Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence.
We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN)
We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z) - Learning Better with Less: Effective Augmentation for Sample-Efficient
Visual Reinforcement Learning [57.83232242068982]
Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms.
It remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL.
This work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy.
arXiv Detail & Related papers (2023-05-25T15:46:20Z) - Object Detection in Aerial Images: What Improves the Accuracy? [9.857292888257144]
deep learning-based object detection approaches have been actively explored for the problem of object detection in aerial images.
In this work, we investigate the impact of Faster R-CNN for aerial object detection and explore numerous strategies to improve its performance for aerial images.
arXiv Detail & Related papers (2022-01-21T16:22:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.