Related papers: CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

URL: http://arxiv.org/abs/2405.10690v4
Date: Mon, 15 Jul 2024 09:31:54 GMT
Title: CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton,
Abstract summary: Weakly supervised audio-visual video parsing methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. We propose CoLeaF, a novel learning framework that optimize the integration of cross-modal context in the embedding space. Our experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the datasets.
Score: 23.100602876056165
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively.

Related papers

Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds [72.83227312675174]
We present a model-agnostic approach to the domain of audio-visual event perception. Our approach includes a score-level fusion technique to retain richer multimodal interactions. We also present the first training-free, open-vocabulary baseline for audio-visual event perception.
arXiv Detail & Related papers (2025-03-17T20:06:48Z)
Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information. We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance. Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z)
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing [58.9467115916639]
We propose a messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion. The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information. We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction.
arXiv Detail & Related papers (2023-11-14T13:27:03Z)
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser [34.19935635508947]
We investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events.
arXiv Detail & Related papers (2023-05-27T02:57:39Z)
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z)
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention. In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space. Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
Cross-Modal learning for Audio-Visual Video Parsing [30.331280948237428]
We present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities. We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning.
arXiv Detail & Related papers (2021-04-03T07:07:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.