Multiple Sound Sources Localization from Coarse to Fine
- URL: http://arxiv.org/abs/2007.06355v2
- Date: Tue, 14 Jul 2020 13:38:52 GMT
- Title: Multiple Sound Sources Localization from Coarse to Fine
- Authors: Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin
- Abstract summary: How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
- Score: 41.56420350529494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to visually localize multiple sound sources in unconstrained videos is a
formidable problem, especially when lack of the pairwise sound-object
annotations. To solve this problem, we develop a two-stage audiovisual learning
framework that disentangles audio and visual representations of different
categories from complex scenes, then performs cross-modal feature alignment in
a coarse-to-fine manner. Our model achieves state-of-the-art results on public
dataset of localization, as well as considerable performance on multi-source
sound localization in complex scenes. We then employ the localization results
for sound separation and obtain comparable performance to existing methods.
These outcomes demonstrate our model's ability in effectively aligning sounds
with specific visual sources. Code is available at
https://github.com/shvdiwnkozbw/Multi-Source-Sound-Localization
Related papers
- Multi-scale Multi-instance Visual Sound Localization and Segmentation [15.624453757710802]
We propose a novel multi-scale visual sound localization framework, namely M2VSL.
M2VSL learns multi-scale semantic features associated with sound sources from the input image to localize sounding objects.
We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks.
arXiv Detail & Related papers (2024-08-31T15:43:22Z) - T-VSL: Text-Guided Visual Sound Source Localization in Mixtures [33.28678401737415]
We develop a framework to disentangle audio-visual source correspondence from multi-source mixtures.
Our framework exhibits promising zero-shot transferability to unseen classes during test time.
Experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-04-02T09:07:05Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Audio-Visual Grouping Network for Sound Localization from Mixtures [30.756247389435803]
Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.
We propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and image.
Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources.
arXiv Detail & Related papers (2023-03-29T22:58:55Z) - Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
Localization [11.059590443280726]
Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research.
In a video, oftentimes, the objects exhibiting movement are the ones generating the sound.
In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source.
arXiv Detail & Related papers (2022-11-06T03:48:45Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.