Pay Self-Attention to Audio-Visual Navigation
- URL: http://arxiv.org/abs/2210.01353v2
- Date: Wed, 5 Oct 2022 06:23:53 GMT
- Title: Pay Self-Attention to Audio-Visual Navigation
- Authors: Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu and Liejun Wang
- Abstract summary: We propose an end-to-end framework to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy.
Our thorough experiments validate the superior performance of FSAAVN in comparison with the state-of-the-arts.
- Score: 24.18976027602831
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Audio-visual embodied navigation, as a hot research topic, aims training a
robot to reach an audio target using egocentric visual (from the sensors
mounted on the robot) and audio (emitted from the target) input. The
audio-visual information fusion strategy is naturally important to the
navigation performance, but the state-of-the-art methods still simply
concatenate the visual and audio features, potentially ignoring the direct
impact of context. Moreover, the existing approaches requires either phase-wise
training or additional aid (e.g. topology graph and sound semantics). Up till
this date, the work that deals with the more challenging setup with moving
target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN
(feature self-attention audio-visual navigation) to learn chasing after a
moving audio target using a context-aware audio-visual fusion strategy
implemented as a self-attention module. Our thorough experiments validate the
superior performance (both quantitatively and qualitatively) of FSAAVN in
comparison with the state-of-the-arts, and also provide unique insights about
the choice of visual modalities, visual/audio encoder backbones and fusion
patterns.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Prompting Segmentation with Sound Is Generalizable Audio-Visual Source
Localizer [22.846623384472377]
We introduce the encoder-prompt-decoder paradigm to decode localization from the fused audio-visual feature.
Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects.
We develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model.
arXiv Detail & Related papers (2023-09-13T05:43:35Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Self-supervised Contrastive Learning for Audio-Visual Action Recognition [7.188231323934023]
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos.
We propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (A), to learn discriminative audio-visual representations for action recognition.
arXiv Detail & Related papers (2022-04-28T10:01:36Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Semantic Audio-Visual Navigation [93.12180578267186]
We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning.
We propose a transformer-based model to tackle this new semantic AudioGoal task.
Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.
arXiv Detail & Related papers (2020-12-21T18:59:04Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.