A Closer Look at Weakly-Supervised Audio-Visual Source Localization
- URL: http://arxiv.org/abs/2209.09634v1
- Date: Tue, 30 Aug 2022 14:17:46 GMT
- Title: A Closer Look at Weakly-Supervised Audio-Visual Source Localization
- Authors: Shentong Mo, Pedro Morgado
- Abstract summary: Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video.
We extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples.
We also propose a new approach for visual sound source localization that addresses both these problems.
- Score: 26.828874753756523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual source localization is a challenging task that aims to predict
the location of visual sound sources in a video. Since collecting ground-truth
annotations of sounding objects can be costly, a plethora of weakly-supervised
localization methods that can learn from datasets with no bounding-box
annotations have been proposed in recent years, by leveraging the natural
co-occurrence of audio and visual signals. Despite significant interest,
popular evaluation protocols have two major flaws. First, they allow for the
use of a fully annotated dataset to perform early stopping, thus significantly
increasing the annotation effort required for training. Second, current
evaluation metrics assume the presence of sound sources at all times. This is
of course an unrealistic assumption, and thus better metrics are necessary to
capture the model's performance on (negative) samples with no visible sound
sources. To accomplish this, we extend the test set of popular benchmarks,
Flickr SoundNet and VGG-Sound Sources, in order to include negative samples,
and measure performance using metrics that balance localization accuracy and
recall. Using the new protocol, we conducted an extensive evaluation of prior
methods, and found that most prior works are not capable of identifying
negatives and suffer from significant overfitting problems (rely heavily on
early stopping for best results). We also propose a new approach for visual
sound source localization that addresses both these problems. In particular, we
found that, through extreme visual dropout and the use of momentum encoders,
the proposed approach combats overfitting effectively, and establishes a new
state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code
and pre-trained models are available at https://github.com/stoneMo/SLAVC.
Related papers
- A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio [5.728456310555323]
This paper introduces a novel test set and metrics designed to complete the current standard evaluation of Visual Sound Source localization models.
We consider three types of negative audio: silence, noise and offscreen.
Our analysis reveals that numerous SOTA models fail to appropriately adjust their predictions based on audio input.
arXiv Detail & Related papers (2024-10-01T19:28:45Z) - Bayesian Detector Combination for Object Detection with Crowdsourced Annotations [49.43709660948812]
Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise.
We propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations.
BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models.
arXiv Detail & Related papers (2024-07-10T18:00:54Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Dual Normalization Multitasking for Audio-Visual Sounding Object
Localization [0.0]
We propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound.
To tackle this new AVSOL problem, we propose a novel multitask training strategy and architecture called Dual Normalization Multitasking.
arXiv Detail & Related papers (2021-06-01T02:02:52Z) - Continual Learning for Fake Audio Detection [62.54860236190694]
This paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally.
Experiments are conducted on the ASVspoof 2019 dataset.
arXiv Detail & Related papers (2021-04-15T07:57:05Z) - Active Learning for Sound Event Detection [18.750572243562576]
This paper proposes an active learning system for sound event detection (SED)
It aims at maximizing the accuracy of a learned SED model with limited annotation effort.
Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare.
arXiv Detail & Related papers (2020-02-12T14:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.