VisualEchoes: Spatial Image Representation Learning through Echolocation
- URL: http://arxiv.org/abs/2005.01616v2
- Date: Fri, 17 Jul 2020 17:13:38 GMT
- Title: VisualEchoes: Spatial Image Representation Learning through Echolocation
- Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen
Grauman
- Abstract summary: Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation.
We propose a novel interaction-based representation learning framework that learns useful visual features via echolocation.
Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
- Score: 97.23789910400387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several animal species (e.g., bats, dolphins, and whales) and even visually
impaired humans have the remarkable ability to perform echolocation: a
biological sonar used to perceive spatial layout and locate objects in the
world. We explore the spatial cues contained in echoes and how they can benefit
vision tasks that require spatial reasoning. First we capture echo responses in
photo-realistic 3D indoor scene environments. Then we propose a novel
interaction-based representation learning framework that learns useful visual
features via echolocation. We show that the learned image features are useful
for multiple downstream vision tasks requiring spatial reasoning---monocular
depth estimation, surface normal estimation, and visual navigation---with
results comparable or even better than heavily supervised pre-training. Our
work opens a new path for representation learning for embodied agents, where
supervision comes from interacting with the physical world.
Related papers
- Learning 3D object-centric representation through prediction [12.008668555280668]
We develop a novel network architecture that learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth.
The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes.
arXiv Detail & Related papers (2024-03-06T14:19:11Z) - Neural feels with neural fields: Visuo-tactile perception for in-hand
manipulation [57.60490773016364]
We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation.
Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem.
Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation.
arXiv Detail & Related papers (2023-12-20T22:36:37Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - The Psychophysics of Human Three-Dimensional Active Visuospatial
Problem-Solving [12.805267089186533]
Are two physical 3D objects visually the same?
Humans are remarkably good at this task without any training, with a mean accuracy of 93.82%.
No learning effect was observed on accuracy after many trials, but some effect was seen for response time, number of fixations and extent of head movement.
arXiv Detail & Related papers (2023-06-19T19:36:42Z) - Pathdreamer: A World Model for Indoor Navigation [62.78410447776939]
We introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments.
Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations.
In regions of high uncertainty, Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes.
arXiv Detail & Related papers (2021-05-18T18:13:53Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Active Perception and Representation for Robotic Manipulation [0.8315801422499861]
We present a framework that leverages the benefits of active perception to accomplish manipulation tasks.
Our agent uses viewpoint changes to localize objects, to learn state representations in a self-supervised manner, and to perform goal-directed actions.
Compared to vanilla deep Q-learning algorithms, our model is at least four times more sample-efficient.
arXiv Detail & Related papers (2020-03-15T01:43:51Z) - Learning Depth With Very Sparse Supervision [57.911425589947314]
This paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment.
We train a specialized global-local network architecture with what would be available to a robot interacting with the environment.
Experiments on several datasets show that, when ground truth is available even for just one of the image pixels, the proposed network can learn monocular dense depth estimation up to 22.5% more accurately than state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-02T10:44:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.