2by2: Weakly-Supervised Learning for Global Action Segmentation
- URL: http://arxiv.org/abs/2412.12829v1
- Date: Tue, 17 Dec 2024 11:49:36 GMT
- Title: 2by2: Weakly-Supervised Learning for Global Action Segmentation
- Authors: Elena Bueno-Benito, Mariella Dimiccoli,
- Abstract summary: This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation.
We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation.
For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity.
- Score: 4.880243880711163
- License:
- Abstract: This paper presents a simple yet effective approach for the poorly investigated task of global action segmentation, aiming at grouping frames capturing the same action across videos of different activities. Unlike the case of videos depicting all the same activity, the temporal order of actions is not roughly shared among all videos, making the task even more challenging. We propose to use activity labels to learn, in a weakly-supervised fashion, action representations suitable for global action segmentation. For this purpose, we introduce a triadic learning approach for video pairs, to ensure intra-video action discrimination, as well as inter-video and inter-activity action association. For the backbone architecture, we use a Siamese network based on sparse transformers that takes as input video pairs and determine whether they belong to the same activity. The proposed approach is validated on two challenging benchmark datasets: Breakfast and YouTube Instructions, outperforming state-of-the-art methods.
Related papers
- Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input.
We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc.
This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z) - Learning to Localize Actions from Moments [153.54638582696128]
We introduce a new design of transfer learning type to learn action localization for a large set of action categories.
We present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework.
arXiv Detail & Related papers (2020-08-31T16:03:47Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.