Set-Constrained Viterbi for Set-Supervised Action Segmentation
- URL: http://arxiv.org/abs/2002.11925v2
- Date: Fri, 27 Mar 2020 23:00:12 GMT
- Title: Set-Constrained Viterbi for Set-Supervised Action Segmentation
- Authors: Jun Li, Sinisa Todorovic
- Abstract summary: This paper is about weakly supervised action segmentation.
The ground truth specifies only a set of actions present in a training video, but not their true temporal ordering.
We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths.
- Score: 40.22433538226469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is about weakly supervised action segmentation, where the ground
truth specifies only a set of actions present in a training video, but not
their true temporal ordering. Prior work typically uses a classifier that
independently labels video frames for generating the pseudo ground truth, and
multiple instance learning for training the classifier. We extend this
framework by specifying an HMM, which accounts for co-occurrences of action
classes and their temporal lengths, and by explicitly training the HMM on a
Viterbi-based loss. Our first contribution is the formulation of a new
set-constrained Viterbi algorithm (SCV). Given a video, the SCV generates the
MAP action segmentation that satisfies the ground truth. This prediction is
used as a framewise pseudo ground truth in our HMM training. Our second
contribution in training is a new regularization of feature affinities between
training videos that share the same action classes. Evaluation on action
segmentation and alignment on the Breakfast, MPII Cooking2, Hollywood Extended
datasets demonstrates our significant performance improvement for the two tasks
over prior work.
Related papers
- Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Iterative Frame-Level Representation Learning And Classification For
Semi-Supervised Temporal Action Segmentation [25.08516972520265]
Temporal action segmentation classifies the action of each frame in (long) video sequences.
We propose the first semi-supervised method for temporal action segmentation.
arXiv Detail & Related papers (2021-12-02T16:47:24Z) - Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS)
Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime.
Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z) - Action Shuffle Alternating Learning for Unsupervised Action Segmentation [38.32743770719661]
We train an RNN to recognize positive and negative action sequences, and the RNN's hidden layer is taken as our new action-level feature embedding.
As supervision of actions is not available, we specify an HMM that explicitly models action lengths, and infer a MAP action segmentation with the Viterbi algorithm.
The resulting action segmentation is used as pseudo-ground truth for estimating our action-level feature embedding and updating the HMM.
arXiv Detail & Related papers (2021-04-05T18:58:57Z) - Anchor-Constrained Viterbi for Set-Supervised Action Segmentation [38.32743770719661]
This paper is about action segmentation under weak supervision in training.
We use a Hidden Markov Model (HMM) grounded on a multilayer perceptron (MLP) to label video frames.
In testing, a Monte Carlo sampling of action sets seen in training is used to generate candidate temporal sequences of actions.
arXiv Detail & Related papers (2021-04-05T18:50:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.