Positive Sample Propagation along the Audio-Visual Event Line
- URL: http://arxiv.org/abs/2104.00239v2
- Date: Mon, 5 Apr 2021 07:28:13 GMT
- Title: Positive Sample Propagation along the Audio-Visual Event Line
- Authors: Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, Meng Wang
- Abstract summary: Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs.
We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
- Score: 29.25572713908162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual and audio signals often coexist in natural environments, forming
audio-visual events (AVEs). Given a video, we aim to localize video segments
containing an AVE and identify its category. In order to learn discriminative
features for a classifier, it is pivotal to identify the helpful (or positive)
audio-visual segment pairs while filtering out the irrelevant ones, regardless
whether they are synchronized or not. To this end, we propose a new positive
sample propagation (PSP) module to discover and exploit the closely related
audio-visual pairs by evaluating the relationship within every possible pair.
It can be done by constructing an all-pair similarity map between each audio
and visual segment, and only aggregating the features from the pairs with high
similarity scores. To encourage the network to extract high correlated features
for positive samples, a new audio-visual pair similarity loss is proposed. We
also propose a new weighting branch to better exploit the temporal correlations
in weakly supervised setting. We perform extensive experiments on the public
AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly
supervised settings, thus verifying the effectiveness of our method.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - Contrastive Positive Sample Propagation along the Audio-Visual Event
Line [24.007548531642716]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
It is pivotal to learn the discriminative features for each video segment.
We propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning.
arXiv Detail & Related papers (2022-11-18T01:55:45Z) - Learning Sound Localization Better From Semantically Similar Samples [79.47083330766002]
Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives.
Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs.
We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets.
arXiv Detail & Related papers (2022-02-07T08:53:55Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.