Semantic Audio-Visual Navigation
- URL: http://arxiv.org/abs/2012.11583v2
- Date: Wed, 7 Apr 2021 01:59:26 GMT
- Title: Semantic Audio-Visual Navigation
- Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman
- Abstract summary: We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning.
We propose a transformer-based model to tackle this new semantic AudioGoal task.
Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.
- Score: 93.12180578267186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work on audio-visual navigation assumes a constantly-sounding target
and restricts the role of audio to signaling the target's position. We
introduce semantic audio-visual navigation, where objects in the environment
make sounds consistent with their semantic meaning (e.g., toilet flushing, door
creaking) and acoustic events are sporadic or short in duration. We propose a
transformer-based model to tackle this new semantic AudioGoal task,
incorporating an inferred goal descriptor that captures both spatial and
semantic properties of the target. Our model's persistent multimodal memory
enables it to reach the goal even long after the acoustic event stops. In
support of the new task, we also expand the SoundSpaces audio simulations to
provide semantically grounded sounds for an array of objects in Matterport3D.
Our method strongly outperforms existing audio-visual navigation methods by
learning to associate semantic, acoustic, and visual cues.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual
Navigation in Noisy Environments [41.21509045214965]
CAVEN is a framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal.
Our results show that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate.
arXiv Detail & Related papers (2023-06-06T22:32:49Z) - Pay Self-Attention to Audio-Visual Navigation [24.18976027602831]
We propose an end-to-end framework to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy.
Our thorough experiments validate the superior performance of FSAAVN in comparison with the state-of-the-arts.
arXiv Detail & Related papers (2022-10-04T03:42:36Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z) - Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources
in Unmapped 3D Environments [0.0]
We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds.
Our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios.
arXiv Detail & Related papers (2022-01-12T03:08:03Z) - Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped
Environments with Moving Sounds [5.002862602915434]
Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment.
We propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds.
We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments.
arXiv Detail & Related papers (2021-11-29T15:17:46Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.