Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
- URL: http://arxiv.org/abs/2406.00919v1
- Date: Mon, 3 Jun 2024 01:09:15 GMT
- Title: Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
- Authors: Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang,
- Abstract summary: The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos.
Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision.
We propose a new pseudo label generation strategy that can explicitly assign labels to each video segment.
- Score: 31.197074786874943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, \ie, the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, \ie, audio event, visual event, and audio-visual event. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event
Localization [0.0]
AVEL is the task of temporally localizing and classifying emphaudio-visual events, i.e., events simultaneously visible and audible in a video.
In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels are available as supervision for training.
Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels.
arXiv Detail & Related papers (2023-07-12T18:13:58Z) - Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language
Perspective [41.07880755312204]
We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities.
We consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels.
Our simple yet effective approach outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-06-01T12:12:22Z) - Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations [91.67511167969934]
imprecise label learning (ILL) is a framework for the unification of learning with various imprecise label configurations.
We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings.
arXiv Detail & Related papers (2023-05-22T04:50:28Z) - Improving Audio-Visual Video Parsing with Pseudo Visual Labels [33.25271156393651]
We propose a new strategy to generate segment-level pseudo labels for audio-visual video parsing.
A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness.
A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs.
arXiv Detail & Related papers (2023-03-04T07:21:37Z) - Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly
Supervised Video Anomaly Detection [149.23913018423022]
Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels.
Two-stage self-training methods have achieved significant improvements by self-generating pseudo labels.
We propose an enhancement framework by exploiting completeness and uncertainty properties for effective self-training.
arXiv Detail & Related papers (2022-12-08T05:53:53Z) - Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video
Parsing [52.2231419645482]
This paper focuses on the weakly-supervised audio-visual video parsing task.
It aims to recognize all events belonging to each modality and localize their temporal boundaries.
arXiv Detail & Related papers (2022-04-25T11:41:17Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.