Audio-Guided Visual Perception for Audio-Visual Navigation
- URL: http://arxiv.org/abs/2510.11760v1
- Date: Mon, 13 Oct 2025 05:06:45 GMT
- Title: Audio-Guided Visual Perception for Audio-Visual Navigation
- Authors: Yi Wang, Yinfeng Yu, Fuchun Sun, Liejun Wang, Wendong Zheng,
- Abstract summary: Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues.<n>This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions.<n>We propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance.
- Score: 38.672625073122276
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.
Related papers
- Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment [26.399212357764576]
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation.<n>We propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module.<n>AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues.<n>UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state.
arXiv Detail & Related papers (2025-03-17T05:48:22Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
We present LoCo, a Locality-aware cross-modal Correspondence learning framework for Audio-Visual Events (DAVE)<n>LoCo applies Local Correspondence Feature (LCF) Modulation to enforce unimodal encoders to focus on modality-shared semantics.<n>We further customize Local Adaptive Cross-modal (LAC) Interaction, which dynamically adjusts attention regions in a data-driven manner.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped
Environments with Moving Sounds [5.002862602915434]
Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment.
We propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds.
We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments.
arXiv Detail & Related papers (2021-11-29T15:17:46Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.