Towards Generalisable Audio Representations for Audio-Visual Navigation
- URL: http://arxiv.org/abs/2206.00393v1
- Date: Wed, 1 Jun 2022 11:00:07 GMT
- Title: Towards Generalisable Audio Representations for Audio-Visual Navigation
- Authors: Shunqi Mao, Chaoyi Zhang, Heng Wang, Weidong Cai
- Abstract summary: In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments.
We propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder.
- Score: 18.738943602529805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In audio-visual navigation (AVN), an intelligent agent needs to navigate to a
constantly sound-making object in complex 3D environments based on its audio
and visual perceptions. While existing methods attempt to improve the
navigation performance with preciously designed path planning or intricate task
settings, none has improved the model generalisation on unheard sounds with
task settings unchanged. We thus propose a contrastive learning-based method to
tackle this challenge by regularising the audio encoder, where the
sound-agnostic goal-driven latent representations can be learnt from various
audio signals of different classes. In addition, we consider two data
augmentation strategies to enrich the training sounds. We demonstrate that our
designs can be easily equipped to existing AVN frameworks to obtain an
immediate performance gain (13.4%$\uparrow$ in SPL on Replica and
12.2%$\uparrow$ in SPL on MP3D). Our project is available at
https://AV-GeN.github.io/.
Related papers
- Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation [7.124066540020968]
Audio-Visual (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic (AVSS) pursues semantic understanding of audio-visual scenes.
Previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization.
We propose a two-stage training strategy called textitStepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization
arXiv Detail & Related papers (2024-07-16T15:08:30Z) - Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio.
Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking.
We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources
in Unmapped 3D Environments [0.0]
We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds.
Our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios.
arXiv Detail & Related papers (2022-01-12T03:08:03Z) - Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped
Environments with Moving Sounds [5.002862602915434]
Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment.
We propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds.
We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments.
arXiv Detail & Related papers (2021-11-29T15:17:46Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.