Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization
- URL: http://arxiv.org/abs/2308.06087v2
- Date: Fri, 18 Aug 2023 03:19:52 GMT
- Title: Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization
- Authors: Sung Jin Um, Dongjin Kim, Jung Uk Kim
- Abstract summary: Humans utilize both audio and visual modalities as spatial cues to locate sound sources.
We propose an audio-visual spatial integration network that integrates spatial cues from both modalities.
Our method can perform more robust sound source localization.
- Score: 13.278494654137138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of the sound source localization task is to enable machines to
detect the location of sound-making objects within a visual scene. While the
audio modality provides spatial cues to locate the sound source, existing
approaches only use audio as an auxiliary role to compare spatial regions of
the visual modality. Humans, on the other hand, utilize both audio and visual
modalities as spatial cues to locate sound sources. In this paper, we propose
an audio-visual spatial integration network that integrates spatial cues from
both modalities to mimic human behavior when detecting sound-making objects.
Additionally, we introduce a recursive attention network to mimic human
behavior of iterative focusing on objects, resulting in more accurate attention
regions. To effectively encode spatial information from both modalities, we
propose audio-visual pair matching loss and spatial region alignment loss. By
utilizing the spatial cues of audio-visual modalities and recursively focusing
objects, our method can perform more robust sound source localization.
Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source
datasets demonstrate the superiority of our proposed method over existing
approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge [14.801564966406486]
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually.
We present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources.
arXiv Detail & Related papers (2024-03-26T06:27:50Z) - LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator.
The proposed LAVSS is inspired by the correlation between spatial audio and visual location.
In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Space-Time Memory Network for Sounding Object Localization in Videos [40.45443192327351]
We propose a space-time memory network for sounding object localization in videos.
It can simultaneously learn uni-temporal attention over both uni-temporal and cross-modal representations.
arXiv Detail & Related papers (2021-11-10T04:40:12Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.