AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation
- URL: http://arxiv.org/abs/2306.06842v2
- Date: Sun, 1 Oct 2023 17:04:35 GMT
- Title: AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation
- Authors: Kashu Yamazaki, Taisei Hanyu, Minh Tran, Adrian de Luis, Roy McCann,
Haitao Liao, Chase Rainwater, Meredith Adkins, Jackson Cothren, Ngan Le
- Abstract summary: We propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Conal Neural Networks (MD-CNNs) at the expanding path.
Our AerialFormer is designed as a hierarchical structure, in which Transformer outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales.
We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam.
- Score: 7.415370401064414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aerial Image Segmentation is a top-down perspective semantic segmentation and
has several challenging characteristics such as strong imbalance in the
foreground-background distribution, complex background, intra-class
heterogeneity, inter-class homogeneity, and tiny objects. To handle these
problems, we inherit the advantages of Transformers and propose AerialFormer,
which unifies Transformers at the contracting path with lightweight
Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path.
Our AerialFormer is designed as a hierarchical structure, in which Transformer
encoder outputs multi-scale features and MD-CNNs decoder aggregates information
from the multi-scales. Thus, it takes both local and global contexts into
consideration to render powerful representations and high-resolution
segmentation. We have benchmarked AerialFormer on three common datasets
including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive
ablation studies show that our proposed AerialFormer outperforms previous
state-of-the-art methods with remarkable performance. Our source code will be
publicly available upon acceptance.
Related papers
- Pyramid Hierarchical Transformer for Hyperspectral Image Classification [1.9427851979929982]
We propose a pyramid-based hierarchical transformer (PyFormer)
This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels.
Results underscore the superiority of the proposed method over traditional approaches.
arXiv Detail & Related papers (2024-04-23T11:41:19Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Learning transformer-based heterogeneously salient graph representation for multimodal remote sensing image classification [42.15709954199397]
A transformer-based heterogeneously salient graph representation (THSGR) approach is proposed in this paper.
First, a multimodal heterogeneous graph encoder is presented to encode distinctively non-Euclidean structural features from heterogeneous data.
A self-attention-free multi-convolutional modulator is designed for effective and efficient long-term dependency modeling.
arXiv Detail & Related papers (2023-11-17T04:06:20Z) - Deep Diversity-Enhanced Feature Representation of Hyperspectral Images [87.47202258194719]
We rectify 3D convolution by modifying its topology to enhance the rank upper-bound.
We also propose a novel diversity-aware regularization (DA-Reg) term that acts on the feature maps to maximize independence among elements.
To demonstrate the superiority of the proposed Re$3$-ConvSet and DA-Reg, we apply them to various HS image processing and analysis tasks.
arXiv Detail & Related papers (2023-01-15T16:19:18Z) - ConvFormer: Combining CNN and Transformer for Medical Image Segmentation [17.88894109620463]
We propose a hierarchical CNN and Transformer hybrid architecture, called ConvFormer, for medical image segmentation.
Our ConvFormer, trained from scratch, outperforms various CNN- or Transformer-based architectures, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-11-15T23:11:22Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.