Semantic Reinforced Attention Learning for Visual Place Recognition
- URL: http://arxiv.org/abs/2108.08443v1
- Date: Thu, 19 Aug 2021 02:14:36 GMT
- Title: Semantic Reinforced Attention Learning for Visual Place Recognition
- Authors: Guohao Peng, Yufeng Yue, Jun Zhang, Zhenyu Wu, Xiaoyu Tang and Danwei
Wang
- Abstract summary: Large-scale visual place recognition (VPR) is inherently challenging because not all visual cues in the image are beneficial to the task.
We propose a novel Semantic Reinforced Attention Learning Network (SRALNet), in which the inferred attention can benefit from both semantic priors and data-driven fine-tuning.
Experiments demonstrate that our method outperforms state-of-the-art techniques on city-scale VPR benchmark datasets.
- Score: 15.84086970453363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale visual place recognition (VPR) is inherently challenging because
not all visual cues in the image are beneficial to the task. In order to
highlight the task-relevant visual cues in the feature embedding, the existing
attention mechanisms are either based on artificial rules or trained in a
thorough data-driven manner. To fill the gap between the two types, we propose
a novel Semantic Reinforced Attention Learning Network (SRALNet), in which the
inferred attention can benefit from both semantic priors and data-driven
fine-tuning. The contribution lies in two-folds. (1) To suppress misleading
local features, an interpretable local weighting scheme is proposed based on
hierarchical feature distribution. (2) By exploiting the interpretability of
the local weighting scheme, a semantic constrained initialization is proposed
so that the local attention can be reinforced by semantic priors. Experiments
demonstrate that our method outperforms state-of-the-art techniques on
city-scale VPR benchmark datasets.
Related papers
- ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery.
The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label [7.400926717561454]
This paper investigates a framework for weakly-supervised object localization.
It aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels.
arXiv Detail & Related papers (2024-04-15T06:02:09Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Learning Semantics for Visual Place Recognition through Multi-Scale
Attention [14.738954189759156]
We present the first VPR algorithm that learns robust global embeddings from both visual appearance and semantic content of the data.
Experiments on various scenarios validate this new approach and demonstrate its performance against state-of-the-art methods.
arXiv Detail & Related papers (2022-01-24T14:13:12Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.