Anchor-Constrained Viterbi for Set-Supervised Action Segmentation
- URL: http://arxiv.org/abs/2104.02113v1
- Date: Mon, 5 Apr 2021 18:50:21 GMT
- Title: Anchor-Constrained Viterbi for Set-Supervised Action Segmentation
- Authors: Jun Li, Sinisa Todorovic
- Abstract summary: This paper is about action segmentation under weak supervision in training.
We use a Hidden Markov Model (HMM) grounded on a multilayer perceptron (MLP) to label video frames.
In testing, a Monte Carlo sampling of action sets seen in training is used to generate candidate temporal sequences of actions.
- Score: 38.32743770719661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is about action segmentation under weak supervision in training,
where the ground truth provides only a set of actions present, but neither
their temporal ordering nor when they occur in a training video. We use a
Hidden Markov Model (HMM) grounded on a multilayer perceptron (MLP) to label
video frames, and thus generate a pseudo-ground truth for the subsequent
pseudo-supervised training. In testing, a Monte Carlo sampling of action sets
seen in training is used to generate candidate temporal sequences of actions,
and select the maximum posterior sequence. Our key contribution is a new
anchor-constrained Viterbi algorithm (ACV) for generating the pseudo-ground
truth, where anchors are salient action parts estimated for each action from a
given ground-truth set. Our evaluation on the tasks of action segmentation and
alignment on the benchmark Breakfast, MPII Cooking2, Hollywood Extended
datasets demonstrates our superior performance relative to that of prior work.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Action Shuffle Alternating Learning for Unsupervised Action Segmentation [38.32743770719661]
We train an RNN to recognize positive and negative action sequences, and the RNN's hidden layer is taken as our new action-level feature embedding.
As supervision of actions is not available, we specify an HMM that explicitly models action lengths, and infer a MAP action segmentation with the Viterbi algorithm.
The resulting action segmentation is used as pseudo-ground truth for estimating our action-level feature embedding and updating the HMM.
arXiv Detail & Related papers (2021-04-05T18:58:57Z) - Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled
Videos [4.318555434063273]
This paper focuses on weakly-supervised action alignment, where only the ordered sequence of video-level actions is available for training.
We propose a novel Duration Network, which captures a short temporal window of the video and learns to predict the remaining duration of a given action at any point in time with a level of granularity based on the type of that action.
arXiv Detail & Related papers (2020-11-20T03:16:53Z) - Weakly Supervised Temporal Action Localization with Segment-Level Labels [140.68096218667162]
Temporal action localization presents a trade-off between test performance and annotation-time cost.
We introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here.
We devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments.
arXiv Detail & Related papers (2020-07-03T10:32:19Z) - Set-Constrained Viterbi for Set-Supervised Action Segmentation [40.22433538226469]
This paper is about weakly supervised action segmentation.
The ground truth specifies only a set of actions present in a training video, but not their true temporal ordering.
We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths.
arXiv Detail & Related papers (2020-02-27T05:32:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.