All Grains, One Scheme (AGOS): Learning Multi-grain Instance
Representation for Aerial Scene Classification
- URL: http://arxiv.org/abs/2205.03371v1
- Date: Fri, 6 May 2022 17:10:44 GMT
- Title: All Grains, One Scheme (AGOS): Learning Multi-grain Instance
Representation for Aerial Scene Classification
- Authors: Qi Bi, Beichen Zhou, Kun Qin, Qinghao Ye, Gui-Song Xia
- Abstract summary: We propose a novel all grains, one scheme (AGOS) framework to tackle these challenges.
It consists of a multi-grain perception module (MGP), a multi-branch multi-instance representation module (MBMIR) and a self-aligned semantic fusion (SSF) module.
Our AGOS is flexible and can be easily adapted to existing CNNs in a plug-and-play manner.
- Score: 31.412401135677744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aerial scene classification remains challenging as: 1) the size of key
objects in determining the scene scheme varies greatly; 2) many objects
irrelevant to the scene scheme are often flooded in the image. Hence, how to
effectively perceive the region of interests (RoIs) from a variety of sizes and
build more discriminative representation from such complicated object
distribution is vital to understand an aerial scene. In this paper, we propose
a novel all grains, one scheme (AGOS) framework to tackle these challenges. To
the best of our knowledge, it is the first work to extend the classic multiple
instance learning into multi-grain formulation. Specially, it consists of a
multi-grain perception module (MGP), a multi-branch multi-instance
representation module (MBMIR) and a self-aligned semantic fusion (SSF) module.
Firstly, our MGP preserves the differential dilated convolutional features from
the backbone, which magnifies the discriminative information from multi-grains.
Then, our MBMIR highlights the key instances in the multi-grain representation
under the MIL formulation. Finally, our SSF allows our framework to learn the
same scene scheme from multi-grain instance representations and fuses them, so
that the entire framework is optimized as a whole. Notably, our AGOS is
flexible and can be easily adapted to existing CNNs in a plug-and-play manner.
Extensive experiments on UCM, AID and NWPU benchmarks demonstrate that our AGOS
achieves a comparable performance against the state-of-the-art methods.
Related papers
- Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones.
We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead.
We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z) - Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.
We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts.
For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - DuAT: Dual-Aggregation Transformer Network for Medical Image
Segmentation [21.717520350930705]
Transformer-based models have been widely demonstrated to be successful in computer vision tasks.
However, they are often dominated by features of large patterns leading to the loss of local details.
We propose a Dual-Aggregation Transformer Network called DuAT, which is characterized by two innovative designs.
Our proposed model outperforms state-of-the-art methods in the segmentation of skin lesion images, and polyps in colonoscopy images.
arXiv Detail & Related papers (2022-12-21T07:54:02Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - MGML: Multi-Granularity Multi-Level Feature Ensemble Network for Remote
Sensing Scene Classification [15.856162817494726]
We propose a Multi-granularity Multi-Level Feature Ensemble Network (MGML-FENet) to efficiently tackle RS scene classification task.
We show that our proposed networks achieve better performance than previous state-of-the-art (SOTA) networks.
arXiv Detail & Related papers (2020-12-29T02:18:11Z) - Fine-Grained Visual Classification via Progressive Multi-Granularity
Training of Jigsaw Patches [67.51747235117]
Fine-grained visual classification (FGVC) is much more challenging than traditional classification tasks.
Recent works mainly tackle this problem by focusing on how to locate the most discriminative parts.
We propose a novel framework for fine-grained visual classification to tackle these problems.
arXiv Detail & Related papers (2020-03-08T19:27:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.