ASFormer: Transformer for Action Segmentation
- URL: http://arxiv.org/abs/2110.08568v1
- Date: Sat, 16 Oct 2021 13:07:20 GMT
- Title: ASFormer: Transformer for Action Segmentation
- Authors: Fangqiu Yi and Hongyu Wen and Tingting Jiang
- Abstract summary: We present an efficient Transformer-based model for action segmentation task, named ASFormer.
It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets.
We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences.
- Score: 9.509416095106493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Algorithms for the action segmentation task typically use temporal models to
predict what action is occurring at each frame for a minute-long daily
activity. Recent studies have shown the potential of Transformer in modeling
the relations among elements in sequential data. However, there are several
major concerns when directly applying the Transformer to the action
segmentation task, such as the lack of inductive biases with small training
sets, the deficit in processing long input sequence, and the limitation of the
decoder architecture to utilize temporal relations among multiple action
segments to refine the initial predictions. To address these concerns, we
design an efficient Transformer-based model for action segmentation task, named
ASFormer, with three distinctive characteristics: (i) We explicitly bring in
the local connectivity inductive priors because of the high locality of
features. It constrains the hypothesis space within a reliable scope, and is
beneficial for the action segmentation task to learn a proper target function
with small training sets. (ii) We apply a pre-defined hierarchical
representation pattern that efficiently handles long input sequences. (iii) We
carefully design the decoder to refine the initial predictions from the
encoder. Extensive experiments on three public datasets demonstrate that
effectiveness of our methods. Code is available at
\url{https://github.com/ChinaYi/ASFormer}.
Related papers
- An Effective-Efficient Approach for Dense Multi-Label Action Detection [23.100602876056165]
It is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships.
Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks.
We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information.
arXiv Detail & Related papers (2024-06-10T11:33:34Z) - Activity Grammars for Temporal Action Segmentation [71.03141719666972]
temporal action segmentation aims at translating an untrimmed activity video into a sequence of action segments.
This paper introduces an effective activity grammar to guide neural predictions for temporal action segmentation.
Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability.
arXiv Detail & Related papers (2023-12-07T12:45:33Z) - BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation [34.88225099758585]
supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost.
We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
arXiv Detail & Related papers (2023-08-28T20:59:15Z) - Semi-Structured Object Sequence Encoders [9.257633944317735]
We focus on the problem of developing a structure-aware input representation for semi-structured object sequences.
This type of data is often represented as a sequence of sets of key-value pairs over time.
We propose a two-part approach, which first considers each key independently and encodes a representation of its values over time.
arXiv Detail & Related papers (2023-01-03T09:19:41Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Generating Sparse Counterfactual Explanations For Multivariate Time
Series [0.5161531917413706]
We propose a generative adversarial network (GAN) architecture that generates SPARse Counterfactual Explanations for multivariate time series.
Our approach provides a custom sparsity layer and regularizes the counterfactual loss function in terms of similarity, sparsity, and smoothness of trajectories.
We evaluate our approach on real-world human motion datasets as well as a synthetic time series interpretability benchmark.
arXiv Detail & Related papers (2022-06-02T08:47:06Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - TraSeTR: Track-to-Segment Transformer with Contrastive Query for
Instance-level Instrument Segmentation in Robotic Surgery [60.439434751619736]
We propose TraSeTR, a Track-to-Segment Transformer that exploits tracking cues to assist surgical instrument segmentation.
TraSeTR jointly reasons about the instrument type, location, and identity with instance-level predictions.
The effectiveness of our method is demonstrated with state-of-the-art instrument type segmentation results on three public datasets.
arXiv Detail & Related papers (2022-02-17T05:52:18Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.