Activity Grammars for Temporal Action Segmentation
- URL: http://arxiv.org/abs/2312.04266v1
- Date: Thu, 7 Dec 2023 12:45:33 GMT
- Title: Activity Grammars for Temporal Action Segmentation
- Authors: Dayoung Gong, Joonseok Lee, Deunsol Jung, Suha Kwak, Minsu Cho
- Abstract summary: temporal action segmentation aims at translating an untrimmed activity video into a sequence of action segments.
This paper introduces an effective activity grammar to guide neural predictions for temporal action segmentation.
Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability.
- Score: 71.03141719666972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence prediction on temporal data requires the ability to understand
compositional structures of multi-level semantics beyond individual and
contextual properties. The task of temporal action segmentation, which aims at
translating an untrimmed activity video into a sequence of action segments,
remains challenging for this reason. This paper addresses the problem by
introducing an effective activity grammar to guide neural predictions for
temporal action segmentation. We propose a novel grammar induction algorithm
that extracts a powerful context-free grammar from action sequence data. We
also develop an efficient generalized parser that transforms frame-level
probability distributions into a reliable sequence of actions according to the
induced grammar with recursive rules. Our approach can be combined with any
neural network for temporal action segmentation to enhance the sequence
prediction and discover its compositional structure. Experimental results
demonstrate that our method significantly improves temporal action segmentation
in terms of both performance and interpretability on two standard benchmarks,
Breakfast and 50 Salads.
Related papers
- Action parsing using context features [0.0]
We argue that context information, particularly the temporal information about other actions in the video sequence, is valuable for action segmentation.
The proposed parsing algorithm temporally segments the video sequence into action segments.
arXiv Detail & Related papers (2022-05-20T07:54:04Z) - FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality
Assessment [93.09267863425492]
We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable.
We construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Discontinuous Grammar as a Foreign Language [0.7412445894287709]
We extend the framework of sequence-to-sequence models for constituent parsing.
We design several novelizations that can fully produce discontinuities.
For the first time, we test a sequence-to-sequence model on the main discontinuous benchmarks.
arXiv Detail & Related papers (2021-10-20T08:58:02Z) - ASFormer: Transformer for Action Segmentation [9.509416095106493]
We present an efficient Transformer-based model for action segmentation task, named ASFormer.
It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets.
We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences.
arXiv Detail & Related papers (2021-10-16T13:07:20Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Learning to Abstract and Predict Human Actions [60.85905430007731]
We model the hierarchical structure of human activities in videos and demonstrate the power of such structure in action prediction.
We propose Hierarchical-Refresher-Anticipator, a multi-level neural machine that can learn the structure of human activities by observing a partial hierarchy of events and roll-out such structure into a future prediction in multiple levels of abstraction.
arXiv Detail & Related papers (2020-08-20T23:57:58Z) - Predicting Temporal Sets with Deep Neural Networks [50.53727580527024]
We propose an integrated solution based on the deep neural networks for temporal sets prediction.
A unique perspective is to learn element relationship by constructing set-level co-occurrence graph.
We design an attention-based module to adaptively learn the temporal dependency of elements and sets.
arXiv Detail & Related papers (2020-06-20T03:29:02Z) - Inferring Temporal Compositions of Actions Using Probabilistic Automata [61.09176771931052]
We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata.
Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
arXiv Detail & Related papers (2020-04-28T00:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.