UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
- URL: http://arxiv.org/abs/2505.09615v1
- Date: Wed, 14 May 2025 17:59:55 GMT
- Title: UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
- Authors: Yung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, François Germain, Michael Jeffrey Jones, Moitreya Chatterjee,
- Abstract summary: This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV)<n>Our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training.<n> Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
- Score: 27.60266755835337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
Related papers
- Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds [72.83227312675174]
We present a model-agnostic approach to the domain of audio-visual event perception.<n>Our approach includes a score-level fusion technique to retain richer multimodal interactions.<n>We also present the first training-free, open-vocabulary baseline for audio-visual event perception.
arXiv Detail & Related papers (2025-03-17T20:06:48Z) - LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing [26.2873961811614]
We introduce a Learning Interaction method for Non-aligned Knowledge (Link)<n>Link equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction.<n>We leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities.
arXiv Detail & Related papers (2024-12-30T11:23:15Z) - Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing [2.918198001105141]
We present a novel joint reinforcement learning-based label denoising approach (RLLD)<n>This approach enables simultaneous training of both label denoising and video parsing models.<n>We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy.
arXiv Detail & Related papers (2024-12-27T10:05:56Z) - Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.<n>This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.<n>We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Weakly Supervised Video Individual CountingWeakly Supervised Video
Individual Counting [126.75545291243142]
Video Individual Counting aims to predict the number of unique individuals in a single video.
We introduce a weakly supervised VIC task, wherein trajectory labels are not provided.
In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining.
arXiv Detail & Related papers (2023-12-10T16:12:13Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event
Parser [34.19935635508947]
We investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed.
To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers.
A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events.
arXiv Detail & Related papers (2023-05-27T02:57:39Z) - Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video
Parsing [52.2231419645482]
This paper focuses on the weakly-supervised audio-visual video parsing task.
It aims to recognize all events belonging to each modality and localize their temporal boundaries.
arXiv Detail & Related papers (2022-04-25T11:41:17Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.