Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
- URL: http://arxiv.org/abs/2412.19563v1
- Date: Fri, 27 Dec 2024 10:05:56 GMT
- Title: Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
- Authors: Yongbiao Gao, Xiangcheng Sun, Guohua Lv, Deng Yu, Sijiu Niu,
- Abstract summary: We present a novel joint reinforcement learning-based label denoising approach (RLLD)<n>This approach enables simultaneous training of both label denoising and video parsing models.<n>We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy.
- Score: 2.918198001105141
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual video parsing (AVVP) aims to recognize audio and visual event labels with precise temporal boundaries, which is quite challenging since audio or visual modality might include only one event label with only the overall video labels available. Existing label denoising models often treat the denoising process as a separate preprocessing step, leading to a disconnect between label denoising and AVVP tasks. To bridge this gap, we present a novel joint reinforcement learning-based label denoising approach (RLLD). This approach enables simultaneous training of both label denoising and video parsing models through a joint optimization strategy. We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy. Extensive experiments on AVVP tasks demonstrate the superior performance of our proposed method compared to label denoising techniques. Furthermore, by incorporating our label denoising method into other AVVP models, we find that it can further enhance parsing results.
Related papers
- UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing [27.60266755835337]
This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV)<n>Our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training.<n> Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
arXiv Detail & Related papers (2025-05-14T17:59:55Z) - LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing [26.2873961811614]
We introduce a Learning Interaction method for Non-aligned Knowledge (Link)
Link equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction.
We leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities.
arXiv Detail & Related papers (2024-12-30T11:23:15Z) - Video Summarization using Denoising Diffusion Probabilistic Model [21.4190413531697]
We introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective.<n>Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction.<n>Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability.
arXiv Detail & Related papers (2024-12-11T13:02:09Z) - Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers [30.965705043127144]
In this paper, we propose a novel unsupervised video denoising framework, named Temporal As aTAP' (TAP)
By incorporating temporal modules, our method can harness temporal information across noisy frames, complementing its power of spatial denoising.
Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.
arXiv Detail & Related papers (2024-09-17T15:05:33Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Neighborhood Collective Estimation for Noisy Label Identification and
Correction [92.20697827784426]
Learning with noisy labels (LNL) aims at designing strategies to improve model performance and generalization by mitigating the effects of model overfitting to noisy labels.
Recent advances employ the predicted label distributions of individual samples to perform noise verification and noisy label correction, easily giving rise to confirmation bias.
We propose Neighborhood Collective Estimation, in which the predictive reliability of a candidate sample is re-estimated by contrasting it against its feature-space nearest neighbors.
arXiv Detail & Related papers (2022-08-05T14:47:22Z) - Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video
Parsing [52.2231419645482]
This paper focuses on the weakly-supervised audio-visual video parsing task.
It aims to recognize all events belonging to each modality and localize their temporal boundaries.
arXiv Detail & Related papers (2022-04-25T11:41:17Z) - IDR: Self-Supervised Image Denoising via Iterative Data Refinement [66.5510583957863]
We present a practical unsupervised image denoising method to achieve state-of-the-art denoising performance.
Our method only requires single noisy images and a noise model, which is easily accessible in practical raw image denoising.
To evaluate raw image denoising performance in real-world applications, we build a high-quality raw image dataset SenseNoise-500 that contains 500 real-life scenes.
arXiv Detail & Related papers (2021-11-29T07:22:53Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Learning Model-Blind Temporal Denoisers without Ground Truths [46.778450578529814]
Denoisers trained with synthetic data often fail to cope with the diversity of unknown noises.
Previous image-based method leads to noise overfitting if directly applied to video denoisers.
We propose a general framework for video denoising networks that successfully addresses these challenges.
arXiv Detail & Related papers (2020-07-07T07:19:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.