Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model
- URL: http://arxiv.org/abs/2208.03987v2
- Date: Wed, 10 Aug 2022 09:31:40 GMT
- Title: Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model
- Authors: Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao and
Liangpei Zhang
- Abstract summary: We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
- Score: 97.9548609175831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision foundation models have made significant progress in visual
tasks on natural images, where the vision transformers are the primary choice
for their good scalability and representation ability. However, the utilization
of large models in the remote sensing (RS) community remains under-explored
where existing models are still at small-scale, which limits the performance.
In this paper, we resort to plain vision transformers with about 100 million
parameters and make the first attempt to propose large vision models customized
for RS tasks and explore how such large models perform. Specifically, to handle
the large image size and objects of various orientations in RS images, we
propose a new rotated varied-size window attention to substitute the original
full attention in transformers, which could significantly reduce the
computational cost and memory footprint while learn better object
representation by extracting rich context from the generated diverse windows.
Experiments on detection tasks demonstrate the superiority of our model over
all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset. The
results of our models on downstream classification and segmentation tasks also
demonstrate competitive performance compared with the existing advanced
methods. Further experiments show the advantages of our models on computational
complexity and few-shot learning.
Related papers
- Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - VmambaIR: Visual State Space Model for Image Restoration [36.11385876754612]
We propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks.
VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters.
arXiv Detail & Related papers (2024-03-18T02:38:55Z) - Heterogeneous Generative Knowledge Distillation with Masked Image
Modeling [33.95780732124864]
Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models.
We develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion.
Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models.
arXiv Detail & Related papers (2023-09-18T08:30:55Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - FoveaTer: Foveated Transformer for Image Classification [8.207403859762044]
We propose foveated Transformer (FoveaTer) model, which uses pooling regions and saccadic movements to perform object classification tasks.
We construct an ensemble model using our proposed model and unfoveated model, achieving an accuracy 1.36% below the unfoveated model with 22% computational savings.
arXiv Detail & Related papers (2021-05-29T01:54:33Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.