Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes
- URL: http://arxiv.org/abs/2203.13412v1
- Date: Fri, 25 Mar 2022 01:42:42 GMT
- Title: Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes
- Authors: Zengjie Song, Yuxi Wang, Junsong Fan, Tieniu Tan, Zhaoxiang Zhang
- Abstract summary: Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
- Score: 91.59435809457659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound source localization in visual scenes aims to localize objects emitting
the sound in a given image. Recent works showing impressive localization
performance typically rely on the contrastive learning framework. However, the
random sampling of negatives, as commonly adopted in these methods, can result
in misalignment between audio and visual features and thus inducing ambiguity
in localization. In this paper, instead of following previous literature, we
propose Self-Supervised Predictive Learning (SSPL), a negative-free method for
sound localization via explicit positive mining. Specifically, we first devise
a three-stream network to elegantly associate sound source with two augmented
views of one corresponding video frame, leading to semantically coherent
similarities between audio and visual features. Second, we introduce a novel
predictive coding module for audio-visual feature alignment. Such a module
assists SSPL to focus on target objects in a progressive manner and effectively
lowers the positive-pair learning difficulty. Experiments show surprising
results that SSPL outperforms the state-of-the-art approach on two standard
sound localization benchmarks. In particular, SSPL achieves significant
improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the
previous best. Code is available at: https://github.com/zjsong/SSPL.
Related papers
- Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z) - A Closer Look at Weakly-Supervised Audio-Visual Source Localization [26.828874753756523]
Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video.
We extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples.
We also propose a new approach for visual sound source localization that addresses both these problems.
arXiv Detail & Related papers (2022-08-30T14:17:46Z) - Localizing Visual Sounds the Easy Way [26.828874753756523]
Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training.
We propose EZ-VSL, without relying on the construction of positive and/or negative regions during training.
Our framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source.
arXiv Detail & Related papers (2022-03-17T13:52:58Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Multiple Sound Sources Localization from Coarse to Fine [41.56420350529494]
How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
arXiv Detail & Related papers (2020-07-13T12:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.