Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception
- URL: http://arxiv.org/abs/2403.11616v2
- Date: Tue, 19 Mar 2024 05:49:31 GMT
- Title: Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception
- Authors: Vijay John, Yasutomo Kawanishi,
- Abstract summary: We propose a novel learning framework to train a video-based action recognition model with weak labels for frame-level perception.
For training the model using the weak labels, we propose a novel latent loss function.
We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks.
- Score: 1.5741307755393597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For training a video-based action recognition model that accepts multi-view video, annotating frame-level labels is tedious and difficult. However, it is relatively easy to annotate sequence-level labels. This kind of coarse annotations are called as weak labels. However, training a multi-view video-based action recognition model with weak labels for frame-level perception is challenging. In this paper, we propose a novel learning framework, where the weak labels are first used to train a multi-view video-based base model, which is subsequently used for downstream frame-level perception tasks. The base model is trained to obtain individual latent embeddings for each view in the multi-view input. For training the model using the weak labels, we propose a novel latent loss function. We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks. The proposed framework is evaluated using the MM Office dataset by comparing several baseline algorithms. The results show that the proposed base model is effectively trained using weak labels and the latent embeddings help the downstream models improve accuracy.
Related papers
- Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event
Localization [0.0]
AVEL is the task of temporally localizing and classifying emphaudio-visual events, i.e., events simultaneously visible and audible in a video.
In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels are available as supervision for training.
Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels.
arXiv Detail & Related papers (2023-07-12T18:13:58Z) - Active Learning for Video Classification with Frame Level Queries [13.135234328352885]
We propose a novel active learning framework for video classification.
Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video.
This involves much less manual work than watching the complete video to come up with a label.
arXiv Detail & Related papers (2023-07-10T15:47:13Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Multi-Scale Self-Contrastive Learning with Hard Negative Mining for
Weakly-Supervised Query-based Video Grounding [27.05117092371221]
We propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting.
Firstly, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames.
Secondly, since some predicted frames are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm.
arXiv Detail & Related papers (2022-03-08T04:01:08Z) - ActionCLIP: A New Paradigm for Video Action Recognition [14.961103794667341]
We provide a new perspective on action recognition by attaching importance to the semantic information of label texts.
We propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune"
arXiv Detail & Related papers (2021-09-17T11:21:34Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.