Action Quality Assessment with Temporal Parsing Transformer
- URL: http://arxiv.org/abs/2207.09270v1
- Date: Tue, 19 Jul 2022 13:29:05 GMT
- Title: Action Quality Assessment with Temporal Parsing Transformer
- Authors: Yang Bai, Desen Zhou, Songyang Zhang, Jian Wang, Errui Ding, Yu Guan,
Yang Long, Jingdong Wang
- Abstract summary: Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences.
We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations.
Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
- Score: 84.1272079121699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Action Quality Assessment(AQA) is important for action understanding and
resolving the task poses unique challenges due to subtle visual differences.
Existing state-of-the-art methods typically rely on the holistic video
representations for score regression or ranking, which limits the
generalization to capture fine-grained intra-class variation. To overcome the
above limitation, we propose a temporal parsing transformer to decompose the
holistic feature into temporal part-level representations. Specifically, we
utilize a set of learnable queries to represent the atomic temporal patterns
for a specific action. Our decoding process converts the frame representations
to a fixed number of temporally ordered part representations. To obtain the
quality score, we adopt the state-of-the-art contrastive regression based on
the part representations. Since existing AQA datasets do not provide temporal
part-level labels or partitions, we propose two novel loss functions on the
cross attention responses of the decoder: a ranking loss to ensure the
learnable queries to satisfy the temporal order in cross attention and a
sparsity loss to encourage the part representations to be more discriminative.
Extensive experiments show that our proposed method outperforms prior work on
three public AQA benchmarks by a considerable margin.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Interpretable Long-term Action Quality Assessment [12.343701556374556]
Long-term Action Quality Assessment (AQA) evaluates the execution of activities in videos.
Current AQA methods produce a single score by averaging clip features.
Long-term videos pose additional difficulty due to the complexity and diversity of actions.
arXiv Detail & Related papers (2024-08-21T15:09:09Z) - Deep Common Feature Mining for Efficient Video Semantic Segmentation [29.054945307605816]
We present Deep Common Feature Mining (DCFM) for video semantic segmentation.
DCFM explicitly decomposes features into two complementary components.
We show that our method has a superior balance between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-05T06:17:59Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Spatio-temporal Relation Modeling for Few-shot Action Recognition [100.3999454780478]
We propose a few-shot action recognition framework, STRM, which enhances class-specific featureriminability while simultaneously learning higher-order temporal representations.
Our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
arXiv Detail & Related papers (2021-12-09T18:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.