Related papers: Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

URL: http://arxiv.org/abs/2307.06385v2
Date: Wed, 19 Jul 2023 14:51:37 GMT
Title: Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization
Authors: Kalyan Ramakrishnan
Abstract summary: AVEL is the task of temporally localizing and classifying emphaudio-visual events, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.

Related papers

Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling [31.197074786874943]
The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. We propose a new pseudo label generation strategy that can explicitly assign labels to each video segment.
arXiv Detail & Related papers (2024-06-03T01:09:15Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection [10.269746485037935]
We propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance for WSVAD. Our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole.
arXiv Detail & Related papers (2024-04-12T15:18:25Z)
Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception [1.5741307755393597]
We propose a novel learning framework to train a video-based action recognition model with weak labels for frame-level perception. For training the model using the weak labels, we propose a novel latent loss function. We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks.
arXiv Detail & Related papers (2024-03-18T09:47:41Z)
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos. We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z)
Two-shot Video Object Segmentation [35.48207692959968]
We train a video object segmentation model on sparsely annotated videos. We generate pseudo labels for unlabeled frames and optimize the model on the combination of labeled and pseudo-labeled data. For the first time, we present a general way to train VOS models on two-shot VOS datasets.
arXiv Detail & Related papers (2023-03-21T17:59:56Z)
Improving Audio-Visual Video Parsing with Pseudo Visual Labels [33.25271156393651]
We propose a new strategy to generate segment-level pseudo labels for audio-visual video parsing. A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness. A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs.
arXiv Detail & Related papers (2023-03-04T07:21:37Z)
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z)
Reducing the Annotation Effort for Video Object Segmentation Datasets [50.893073670389164]
densely labeling every frame with pixel masks does not scale to large datasets. We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations. We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
arXiv Detail & Related papers (2020-11-02T17:34:45Z)
Labelling unlabelled videos from scratch with multi-modal self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders. We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.