Aerial Scene Understanding in The Wild: Multi-Scene Recognition via
Prototype-based Memory Networks
- URL: http://arxiv.org/abs/2104.11200v1
- Date: Thu, 22 Apr 2021 17:32:14 GMT
- Title: Aerial Scene Understanding in The Wild: Multi-Scene Recognition via
Prototype-based Memory Networks
- Authors: Yuansheng Hua, Lichao Moua, Jianzhe Lin, Konrad Heidler, Xiao Xiang
Zhu
- Abstract summary: We propose a prototype-based memory network to recognize multiple scenes in a single image.
The proposed network consists of three key components: 1) a prototype learning module, 2) a prototype-inhabiting external memory, and 3) a multi-head attention-based memory retrieval module.
To facilitate the progress of aerial scene recognition, we produce a new multi-scene aerial image (MAI) dataset.
- Score: 14.218223473363276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aerial scene recognition is a fundamental visual task and has attracted an
increasing research interest in the last few years. Most of current researches
mainly deploy efforts to categorize an aerial image into one scene-level label,
while in real-world scenarios, there often exist multiple scenes in a single
image. Therefore, in this paper, we propose to take a step forward to a more
practical and challenging task, namely multi-scene recognition in single
images. Moreover, we note that manually yielding annotations for such a task is
extraordinarily time- and labor-consuming. To address this, we propose a
prototype-based memory network to recognize multiple scenes in a single image
by leveraging massive well-annotated single-scene images. The proposed network
consists of three key components: 1) a prototype learning module, 2) a
prototype-inhabiting external memory, and 3) a multi-head attention-based
memory retrieval module. To be more specific, we first learn the prototype
representation of each aerial scene from single-scene aerial image datasets and
store it in an external memory. Afterwards, a multi-head attention-based memory
retrieval module is devised to retrieve scene prototypes relevant to query
multi-scene images for final predictions. Notably, only a limited number of
annotated multi-scene images are needed in the training phase. To facilitate
the progress of aerial scene recognition, we produce a new multi-scene aerial
image (MAI) dataset. Experimental results on variant dataset configurations
demonstrate the effectiveness of our network. Our dataset and codes are
publicly available.
Related papers
- Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Self-attention on Multi-Shifted Windows for Scene Segmentation [14.47974086177051]
We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features.
We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction.
Our models achieve very promising performance on four public scene segmentation datasets.
arXiv Detail & Related papers (2022-07-10T07:36:36Z) - Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z) - Rectifying the Shortcut Learning of Background: Shared Object
Concentration for Few-Shot Image Recognition [101.59989523028264]
Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks.
We propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage.
arXiv Detail & Related papers (2021-07-16T07:46:41Z) - DeepMultiCap: Performance Capture of Multiple Characters Using Sparse
Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras.
Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z) - MultiScene: A Large-scale Dataset and Benchmark for Multi-scene
Recognition in Single Aerial Images [17.797726722637634]
We create a large-scale dataset, called MultiScene, composed of 100,000 high-resolution aerial images.
We visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean.
We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images.
arXiv Detail & Related papers (2021-04-07T01:09:12Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Multiple instance learning on deep features for weakly supervised object
detection with extreme domain shifts [1.9336815376402716]
Weakly supervised object detection (WSOD) using only image-level annotations has attracted a growing attention over the past few years.
We show that a simple multiple instance approach applied on pre-trained deep features yields excellent performances on non-photographic datasets.
arXiv Detail & Related papers (2020-08-03T20:36:01Z) - AiRound and CV-BrCT: Novel Multi-View Datasets for Scene Classification [2.931113769364182]
We present two new publicly available datasets named thedatasetand CV-BrCT.
The first one contains triplets of images from the same geographic coordinate with different perspectives of view extracted from various places around the world.
The second dataset contains pairs of aerial and street-level images extracted from southeast Brazil.
arXiv Detail & Related papers (2020-08-03T18:55:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.