Agents that Listen: High-Throughput Reinforcement Learning with Multiple
Sensory Systems
- URL: http://arxiv.org/abs/2107.02195v1
- Date: Mon, 5 Jul 2021 18:00:50 GMT
- Title: Agents that Listen: High-Throughput Reinforcement Learning with Multiple
Sensory Systems
- Authors: Shashank Hegde, Anssi Kanervisto, Aleksei Petrenko
- Abstract summary: We introduce a new version of VizDoom simulator to create a highly efficient learning environment that provides raw audio observations.
We train our agent to play the full game of Doom and find that it can consistently defeat a traditional vision-based adversary.
- Score: 6.952659395337689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans and other intelligent animals evolved highly sophisticated perception
systems that combine multiple sensory modalities. On the other hand,
state-of-the-art artificial agents rely mostly on visual inputs or structured
low-dimensional observations provided by instrumented environments. Learning to
act based on combined visual and auditory inputs is still a new topic of
research that has not been explored beyond simple scenarios. To facilitate
progress in this area we introduce a new version of VizDoom simulator to create
a highly efficient learning environment that provides raw audio observations.
We study the performance of different model architectures in a series of tasks
that require the agent to recognize sounds and execute instructions given in
natural language. Finally, we train our agent to play the full game of Doom and
find that it can consistently defeat a traditional vision-based adversary. We
are currently in the process of merging the augmented simulator with the main
ViZDoom code repository. Video demonstrations and experiment code can be found
at https://sites.google.com/view/sound-rl.
Related papers
- ViSaRL: Visual Reinforcement Learning Guided by Human Saliency [6.969098096933547]
We introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL)
Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent.
We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations.
arXiv Detail & Related papers (2024-03-16T14:52:26Z) - Scaling Instructable Agents Across Many Simulated Worlds [70.97268311053328]
Our goal is to develop an agent that can accomplish anything a human can do in any simulated 3D environment.
Our approach focuses on language-driven generality while imposing minimal assumptions.
Our agents interact with environments in real-time using a generic, human-like interface.
arXiv Detail & Related papers (2024-03-13T17:50:32Z) - Learning of Generalizable and Interpretable Knowledge in Grid-Based
Reinforcement Learning Environments [5.217870815854702]
We propose using program synthesis to imitate reinforcement learning policies.
We adapt the state-of-the-art program synthesis system DreamCoder for learning concepts in grid-based environments.
arXiv Detail & Related papers (2023-09-07T11:46:57Z) - Sonicverse: A Multisensory Simulation Platform for Embodied Household
Agents that See and Hear [65.33183123368804]
Sonicverse is a multisensory simulation platform with integrated audio-visual simulation.
It enables embodied AI tasks that need audio-visual perception.
An agent trained in Sonicverse can successfully perform audio-visual navigation in real-world environments.
arXiv Detail & Related papers (2023-06-01T17:24:01Z) - Evaluating Continual Learning Algorithms by Generating 3D Virtual
Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment.
We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance.
A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z) - Imitation Learning with Human Eye Gaze via Multi-Objective Prediction [3.5779268406205618]
We propose Gaze Regularized Imitation Learning (GRIL), a novel context-aware imitation learning architecture.
GRIL learns concurrently from both human demonstrations and eye gaze to solve tasks where visual attention provides important context.
We show that GRIL outperforms several state-of-the-art gaze-based imitation learning algorithms, simultaneously learns to predict human visual attention, and generalizes to scenarios not present in the training data.
arXiv Detail & Related papers (2021-02-25T17:13:13Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [127.82594819117753]
We propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions.
We train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration.
Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines.
arXiv Detail & Related papers (2020-07-27T17:59:08Z) - See, Hear, Explore: Curiosity via Audio-Visual Association [46.86865495827888]
A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model.
In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses.
Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration.
arXiv Detail & Related papers (2020-07-07T17:56:35Z) - VisualEchoes: Spatial Image Representation Learning through Echolocation [97.23789910400387]
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation.
We propose a novel interaction-based representation learning framework that learns useful visual features via echolocation.
Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
arXiv Detail & Related papers (2020-05-04T16:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.