Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge
- URL: http://arxiv.org/abs/2403.17420v1
- Date: Tue, 26 Mar 2024 06:27:50 GMT
- Title: Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge
- Authors: Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim,
- Abstract summary: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually.
We present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources.
- Score: 14.801564966406486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL
Related papers
- T-VSL: Text-Guided Visual Sound Source Localization in Mixtures [33.28678401737415]
We develop a framework to disentangle audio-visual source correspondence from multi-source mixtures.
Our framework exhibits promising zero-shot transferability to unseen classes during test time.
Experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-04-02T09:07:05Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources.
We propose an audio-visual spatial integration network that integrates spatial cues from both modalities.
Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z) - Audio-Visual Grouping Network for Sound Localization from Mixtures [30.756247389435803]
Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.
We propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and image.
Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources.
arXiv Detail & Related papers (2023-03-29T22:58:55Z) - Iterative Sound Source Localization for Unknown Number of Sources [57.006589498243336]
We propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met.
Our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms.
arXiv Detail & Related papers (2022-06-24T13:19:44Z) - Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS)
LASS aims to separate a target source from an audio mixture based on a natural language query of the target source.
We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Dual Normalization Multitasking for Audio-Visual Sounding Object
Localization [0.0]
We propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound.
To tackle this new AVSOL problem, we propose a novel multitask training strategy and architecture called Dual Normalization Multitasking.
arXiv Detail & Related papers (2021-06-01T02:02:52Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z) - Do We Need Sound for Sound Source Localization? [12.512982702508669]
We develop an unsupervised learning system that solves sound source localization.
We show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset.
We present an evaluation protocol that enforces both visual and aural information to be leveraged.
arXiv Detail & Related papers (2020-07-11T08:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.