Contrastive Positive Sample Propagation along the Audio-Visual Event
Line
- URL: http://arxiv.org/abs/2211.09980v1
- Date: Fri, 18 Nov 2022 01:55:45 GMT
- Title: Contrastive Positive Sample Propagation along the Audio-Visual Event
Line
- Authors: Jinxing Zhou, Dan Guo, Meng Wang
- Abstract summary: Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
It is pivotal to learn the discriminative features for each video segment.
We propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning.
- Score: 24.007548531642716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual and audio signals often coexist in natural environments, forming
audio-visual events (AVEs). Given a video, we aim to localize video segments
containing an AVE and identify its category. It is pivotal to learn the
discriminative features for each video segment. Unlike existing work focusing
on audio-visual feature fusion, in this paper, we propose a new contrastive
positive sample propagation (CPSP) method for better deep feature
representation learning. The contribution of CPSP is to introduce the available
full or weak label as a prior that constructs the exact positive-negative
samples for contrastive learning. Specifically, the CPSP involves comprehensive
contrastive constraints: pair-level positive sample propagation (PSP),
segment-level and video-level positive sample activation (PSA$_S$ and PSA$_V$).
Three new contrastive objectives are proposed (\emph{i.e.},
$\mathcal{L}_{\text{avpsp}}$, $\mathcal{L}_\text{spsa}$, and
$\mathcal{L}_\text{vpsa}$) and introduced into both the fully and weakly
supervised AVE localization. To draw a complete picture of the contrastive
learning in AVE localization, we also study the self-supervised positive sample
propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined
audio-visual features that are distinguishable from the negatives, thus
benefiting the classifier prediction. Extensive experiments on the AVE and the
newly collected VGGSound-AVEL100k datasets verify the effectiveness and
generalization ability of our method.
Related papers
- Enhancing Sound Source Localization via False Negative Elimination [58.87973081084927]
Sound source localization aims to localize objects emitting the sound in visual scenes.
Recent works obtaining impressive results typically rely on contrastive learning.
We propose a novel audio-visual learning framework which is instantiated with two individual learning schemes.
arXiv Detail & Related papers (2024-08-29T11:24:51Z) - Hyperbolic Audio-visual Zero-shot Learning [47.66672509746274]
An analysis of the audio-visual data reveals a large degree of hyperbolicity, indicating the potential benefit of using a hyperbolic transformation to achieve curvature-aware geometric learning.
The proposed approach employs a novel loss function that incorporates cross-modality alignment between video and audio features in the hyperbolic space.
arXiv Detail & Related papers (2023-08-24T04:52:32Z) - Learning Audio-Visual Source Localization via False Negative Aware
Contrastive Learning [39.890616126301204]
We propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with false negative samples.
FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench.
arXiv Detail & Related papers (2023-03-20T17:41:11Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs.
We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - Audio-Visual Instance Discrimination with Cross-Modal Agreement [90.95132499006498]
We present a self-supervised learning approach to learn audio-visual representations from video and audio.
We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio.
arXiv Detail & Related papers (2020-04-27T16:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.