Do We Need Sound for Sound Source Localization?
- URL: http://arxiv.org/abs/2007.05722v1
- Date: Sat, 11 Jul 2020 08:57:58 GMT
- Title: Do We Need Sound for Sound Source Localization?
- Authors: Takashi Oya, Shohei Iwase, Ryota Natsume, Takahiro Itazuri, Shugo
Yamaguchi, Shigeo Morishima
- Abstract summary: We develop an unsupervised learning system that solves sound source localization.
We show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset.
We present an evaluation protocol that enforces both visual and aural information to be leveraged.
- Score: 12.512982702508669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: During the performance of sound source localization which uses both visual
and aural information, it presently remains unclear how much either image or
sound modalities contribute to the result, i.e. do we need both image and sound
for sound source localization? To address this question, we develop an
unsupervised learning system that solves sound source localization by
decomposing this task into two steps: (i) "potential sound source
localization", a step that localizes possible sound sources using only visual
information (ii) "object selection", a step that identifies which objects are
actually sounding using aural information. Our overall system achieves
state-of-the-art performance in sound source localization, and more
importantly, we find that despite the constraint on available information, the
results of (i) achieve similar performance. From this observation and further
experiments, we show that visual information is dominant in "sound" source
localization when evaluated with the currently adopted benchmark dataset.
Moreover, we show that the majority of sound-producing objects within the
samples in this dataset can be inherently identified using only visual
information, and thus that the dataset is inadequate to evaluate a system's
capability to leverage aural information. As an alternative, we present an
evaluation protocol that enforces both visual and aural information to be
leveraged, and verify this property through several experiments.
Related papers
- Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge [14.801564966406486]
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually.
We present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources.
arXiv Detail & Related papers (2024-03-26T06:27:50Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources.
We propose an audio-visual spatial integration network that integrates spatial cues from both modalities.
Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z) - Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
Localization [11.059590443280726]
Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research.
In a video, oftentimes, the objects exhibiting movement are the ones generating the sound.
In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source.
arXiv Detail & Related papers (2022-11-06T03:48:45Z) - Exploiting Transformation Invariance and Equivariance for
Self-supervised Sound Localisation [32.68710772281511]
We present a self-supervised framework for audio-visual representation learning, to localize the sound source in videos.
Our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound.
This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications.
arXiv Detail & Related papers (2022-06-26T03:00:02Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - A Review of Sound Source Localization with Deep Learning Methods [71.18444724397486]
This article is a review on deep learning methods for single and multiple sound source localization.
We provide an exhaustive topography of the neural-based localization literature in this context.
Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
arXiv Detail & Related papers (2021-09-08T07:25:39Z) - Dual Normalization Multitasking for Audio-Visual Sounding Object
Localization [0.0]
We propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound.
To tackle this new AVSOL problem, we propose a novel multitask training strategy and architecture called Dual Normalization Multitasking.
arXiv Detail & Related papers (2021-06-01T02:02:52Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.