Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event
Parser
- URL: http://arxiv.org/abs/2305.17343v2
- Date: Mon, 2 Oct 2023 08:34:54 GMT
- Title: Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event
Parser
- Authors: Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang
- Abstract summary: We investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed.
To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers.
A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events.
- Score: 34.19935635508947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual learning has been a major pillar of multi-modal machine
learning, where the community mostly focused on its modality-aligned setting,
i.e., the audio and visual modality are both assumed to signal the prediction
target. With the Look, Listen, and Parse dataset (LLP), we investigate the
under-explored unaligned setting, where the goal is to recognize audio and
visual events in a video with only weak labels observed. Such weak video-level
labels only tell what events happen without knowing the modality they are
perceived (audio, visual, or both). To enhance learning in this challenging
setting, we incorporate large-scale contrastively pre-trained models as the
modality teachers. A simple, effective, and generic method, termed Visual-Audio
Label Elaboration (VALOR), is innovated to harvest modality labels for the
training events. Empirical studies show that the harvested labels significantly
improve an attentional baseline by 8.0 in average F-score (Type@AV).
Surprisingly, we found that modality-independent teachers outperform their
modality-fused counterparts since they are noise-proof from the other
potentially unaligned modality. Moreover, our best model achieves the new
state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score
for Type@AV). VALOR is further generalized to Audio-Visual Event Localization
and achieves the new state-of-the-art as well. Code is available at:
https://github.com/Franklin905/VALOR.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing [23.100602876056165]
Weakly supervised audio-visual video parsing methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels.
We propose CoLeaF, a novel learning framework that optimize the integration of cross-modal context in the embedding space.
Our experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the datasets.
arXiv Detail & Related papers (2024-05-17T10:51:15Z) - EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning [36.012107899738524]
We introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning.
Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor.
It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision.
arXiv Detail & Related papers (2024-03-14T15:44:19Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs.
We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.