Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language
Perspective
- URL: http://arxiv.org/abs/2306.00595v6
- Date: Sat, 28 Oct 2023 03:07:09 GMT
- Title: Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language
Perspective
- Authors: Yingying Fan and Yu Wu and Bo Du and Yutian Lin
- Abstract summary: We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities.
We consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels.
Our simple yet effective approach outperforms state-of-the-art methods by a large margin.
- Score: 41.07880755312204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on the weakly-supervised audio-visual video parsing task (AVVP),
which aims to identify and locate all the events in audio/visual modalities.
Previous works only concentrate on video-level overall label denoising across
modalities, but overlook the segment-level label noise, where adjacent video
segments (i.e., 1-second video clips) may contain different events. However,
recognizing events in the segment is challenging because its label could be any
combination of events that occur in the video. To address this issue, we
consider tackling AVVP from the language perspective, since language could
freely describe how various events appear in each segment beyond fixed labels.
Specifically, we design language prompts to describe all cases of event
appearance for each video. Then, the similarity between language prompts and
segments is calculated, where the event of the most similar prompt is regarded
as the segment-level label. In addition, to deal with the mislabeled segments,
we propose to perform dynamic re-weighting on the unreliable segments to adjust
their labels. Experiments show that our simple yet effective approach
outperforms state-of-the-art methods by a large margin.
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling [31.197074786874943]
The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos.
Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision.
We propose a new pseudo label generation strategy that can explicitly assign labels to each video segment.
arXiv Detail & Related papers (2024-06-03T01:09:15Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Improving Audio-Visual Video Parsing with Pseudo Visual Labels [33.25271156393651]
We propose a new strategy to generate segment-level pseudo labels for audio-visual video parsing.
A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness.
A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs.
arXiv Detail & Related papers (2023-03-04T07:21:37Z) - Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video
Parsing [52.2231419645482]
This paper focuses on the weakly-supervised audio-visual video parsing task.
It aims to recognize all events belonging to each modality and localize their temporal boundaries.
arXiv Detail & Related papers (2022-04-25T11:41:17Z) - Investigating Modality Bias in Audio Visual Video Parsing [31.83076679253096]
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries.
An existing state-of-the-art model for AVVP uses a hybrid attention network (HAN) to generate cross-modal features for both audio and visual modalities.
We propose a variant of feature aggregation in HAN that leads to an absolute gain in F-scores of about 2% and 1.6% for visual and audio-visual events at both segment-level and event-level.
arXiv Detail & Related papers (2022-03-31T07:43:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.