Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration
- URL: http://arxiv.org/abs/2108.11717v1
- Date: Thu, 26 Aug 2021 11:41:03 GMT
- Title: Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration
- Authors: Soroush Seifi, Abhishek Jha, Tinne Tuytelaars
- Abstract summary: Active visual exploration aims to assist an agent with a limited field of view to understand its environment based on partial observations.
We propose the Glimpse-Attend-and-Explore model which employs self-attention to guide the visual exploration instead of task-specific uncertainty maps.
Our model provides encouraging results while being less dependent on dataset bias in driving the exploration.
- Score: 47.01485765231528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active visual exploration aims to assist an agent with a limited field of
view to understand its environment based on partial observations made by
choosing the best viewing directions in the scene. Recent methods have tried to
address this problem either by using reinforcement learning, which is difficult
to train, or by uncertainty maps, which are task-specific and can only be
implemented for dense prediction tasks. In this paper, we propose the
Glimpse-Attend-and-Explore model which: (a) employs self-attention to guide the
visual exploration instead of task-specific uncertainty maps; (b) can be used
for both dense and sparse prediction tasks; and (c) uses a contrastive stream
to further improve the representations learned. Unlike previous works, we show
the application of our model on multiple tasks like reconstruction,
segmentation and classification. Our model provides encouraging results while
being less dependent on dataset bias in driving the exploration. We further
perform an ablation study to investigate the features and attention learned by
our model. Finally, we show that our self-attention module learns to attend
different regions of the scene by minimizing the loss on the downstream task.
Code: https://github.com/soroushseifi/glimpse-attend-explore.
Related papers
- Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Active Sensing with Predictive Coding and Uncertainty Minimization [0.0]
We present an end-to-end procedure for embodied exploration inspired by two biological computations.
We first demonstrate our approach in a maze navigation task and show that it can discover the underlying transition distributions and spatial features of the environment.
We show that our model builds unsupervised representations through exploration that allow it to efficiently categorize visual scenes.
arXiv Detail & Related papers (2023-07-02T21:14:49Z) - Learning to Explore Informative Trajectories and Samples for Embodied
Perception [24.006056116516618]
Generalizing perception models to unseen embodied tasks is insufficiently studied.
We build a 3D semantic distribution map to train the exploration policy self-supervised.
With the explored informative trajectories, we propose to select hard samples on trajectories based on the semantic distribution uncertainty.
Experiments show that the perception model fine-tuned with our method outperforms the baselines trained with other exploration policies.
arXiv Detail & Related papers (2023-03-20T08:20:04Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Embodied Visual Active Learning for Semantic Segmentation [33.02424587900808]
We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding.
We develop a battery of agents - both learnt and pre-specified - and with different levels of knowledge of the environment.
We extensively evaluate the proposed models using the Matterport3D simulator and show that a fully learnt method outperforms comparable pre-specified counterparts.
arXiv Detail & Related papers (2020-12-17T11:02:34Z) - Latent World Models For Intrinsically Motivated Exploration [140.21871701134626]
We present a self-supervised representation learning method for image-based observations.
We consider episodic and life-long uncertainties to guide the exploration of partially observable environments.
arXiv Detail & Related papers (2020-10-05T19:47:04Z) - Analyzing Visual Representations in Embodied Navigation Tasks [45.35107294831313]
We use the recently proposed projection weighted Canonical Correlation Analysis (PWCCA) to measure the similarity of visual representations learned in the same environment by performing different tasks.
We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.
arXiv Detail & Related papers (2020-03-12T19:43:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.