A Closer Look at Weakly-Supervised Audio-Visual Source Localization
- URL: http://arxiv.org/abs/2209.09634v1
- Date: Tue, 30 Aug 2022 14:17:46 GMT
- Title: A Closer Look at Weakly-Supervised Audio-Visual Source Localization
- Authors: Shentong Mo, Pedro Morgado
- Abstract summary: Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video.
We extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples.
We also propose a new approach for visual sound source localization that addresses both these problems.
- Score: 26.828874753756523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual source localization is a challenging task that aims to predict
the location of visual sound sources in a video. Since collecting ground-truth
annotations of sounding objects can be costly, a plethora of weakly-supervised
localization methods that can learn from datasets with no bounding-box
annotations have been proposed in recent years, by leveraging the natural
co-occurrence of audio and visual signals. Despite significant interest,
popular evaluation protocols have two major flaws. First, they allow for the
use of a fully annotated dataset to perform early stopping, thus significantly
increasing the annotation effort required for training. Second, current
evaluation metrics assume the presence of sound sources at all times. This is
of course an unrealistic assumption, and thus better metrics are necessary to
capture the model's performance on (negative) samples with no visible sound
sources. To accomplish this, we extend the test set of popular benchmarks,
Flickr SoundNet and VGG-Sound Sources, in order to include negative samples,
and measure performance using metrics that balance localization accuracy and
recall. Using the new protocol, we conducted an extensive evaluation of prior
methods, and found that most prior works are not capable of identifying
negatives and suffer from significant overfitting problems (rely heavily on
early stopping for best results). We also propose a new approach for visual
sound source localization that addresses both these problems. In particular, we
found that, through extreme visual dropout and the use of momentum encoders,
the proposed approach combats overfitting effectively, and establishes a new
state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code
and pre-trained models are available at https://github.com/stoneMo/SLAVC.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.