Improving Audio-Visual Video Parsing with Pseudo Visual Labels
- URL: http://arxiv.org/abs/2303.02344v1
- Date: Sat, 4 Mar 2023 07:21:37 GMT
- Title: Improving Audio-Visual Video Parsing with Pseudo Visual Labels
- Authors: Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
- Abstract summary: We propose a new strategy to generate segment-level pseudo labels for audio-visual video parsing.
A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness.
A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs.
- Score: 33.25271156393651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Video Parsing is a task to predict the events that occur in
video segments for each modality. It often performs in a weakly supervised
manner, where only video event labels are provided, i.e., the modalities and
the timestamps of the labels are unknown. Due to the lack of densely annotated
labels, recent work attempts to leverage pseudo labels to enrich the
supervision. A commonly used strategy is to generate pseudo labels by
categorizing the known event labels for each modality. However, the labels are
still limited to the video level, and the temporal boundaries of event
timestamps remain unlabeled. In this paper, we propose a new pseudo label
generation strategy that can explicitly assign labels to each video segment by
utilizing prior knowledge learned from the open world. Specifically, we exploit
the CLIP model to estimate the events in each video segment based on visual
modality to generate segment-level pseudo labels. A new loss function is
proposed to regularize these labels by taking into account their
category-richness and segmentrichness. A label denoising strategy is adopted to
improve the pseudo labels by flipping them whenever high forward binary cross
entropy loss occurs. We perform extensive experiments on the LLP dataset and
demonstrate that our method can generate high-quality segment-level pseudo
labels with the help of our newly proposed loss and the label denoising
strategy. Our method achieves state-of-the-art audio-visual video parsing
performance.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling [31.197074786874943]
The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos.
Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision.
We propose a new pseudo label generation strategy that can explicitly assign labels to each video segment.
arXiv Detail & Related papers (2024-06-03T01:09:15Z) - Pseudo-labelling meets Label Smoothing for Noisy Partial Label Learning [8.387189407144403]
Partial label learning (PLL) is a weakly-supervised learning paradigm where each training instance is paired with a set of candidate labels (partial label)
NPLL relaxes this constraint by allowing some partial labels to not contain the true label, enhancing the practicality of the problem.
We present a minimalistic framework that initially assigns pseudo-labels to images by exploiting the noisy partial labels through a weighted nearest neighbour algorithm.
arXiv Detail & Related papers (2024-02-07T13:32:47Z) - Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event
Localization [0.0]
AVEL is the task of temporally localizing and classifying emphaudio-visual events, i.e., events simultaneously visible and audible in a video.
In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels are available as supervision for training.
Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels.
arXiv Detail & Related papers (2023-07-12T18:13:58Z) - Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language
Perspective [41.07880755312204]
We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities.
We consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels.
Our simple yet effective approach outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-06-01T12:12:22Z) - BadLabel: A Robust Perspective on Evaluating and Enhancing Label-noise
Learning [113.8799653759137]
We introduce a novel label noise type called BadLabel, which can significantly degrade the performance of existing LNL algorithms by a large margin.
BadLabel is crafted based on the label-flipping attack against standard classification.
We propose a robust LNL method that perturbs the labels in an adversarial manner at each epoch to make the loss values of clean and noisy labels again distinguishable.
arXiv Detail & Related papers (2023-05-28T06:26:23Z) - Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations [91.67511167969934]
imprecise label learning (ILL) is a framework for the unification of learning with various imprecise label configurations.
We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings.
arXiv Detail & Related papers (2023-05-22T04:50:28Z) - Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly
Supervised Video Anomaly Detection [149.23913018423022]
Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels.
Two-stage self-training methods have achieved significant improvements by self-generating pseudo labels.
We propose an enhancement framework by exploiting completeness and uncertainty properties for effective self-training.
arXiv Detail & Related papers (2022-12-08T05:53:53Z) - Learning from Pixel-Level Label Noise: A New Perspective for
Semi-Supervised Semantic Segmentation [12.937770890847819]
We propose a graph based label noise detection and correction framework to deal with pixel-level noisy labels.
In particular, for the generated pixel-level noisy labels from weak supervisions by Class Activation Map (CAM), we train a clean segmentation model with strong supervisions.
Finally, we adopt a superpixel-based graph to represent the relations of spatial adjacency and semantic similarity between pixels in one image.
arXiv Detail & Related papers (2021-03-26T03:23:21Z) - A Study on the Autoregressive and non-Autoregressive Multi-label
Learning [77.11075863067131]
We propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly.
Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies.
arXiv Detail & Related papers (2020-12-03T05:41:44Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.