Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation
- URL: http://arxiv.org/abs/2410.06353v1
- Date: Tue, 8 Oct 2024 20:42:51 GMT
- Title: Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation
- Authors: Bowen Chen, Haoyu Ji, Zhiyong Wang, Benjamin Filtjens, Chunzhuo Wang, Weihong Ren, Bart Vanrumste, Honghai Liu,
- Abstract summary: Skeleton-based Temporal Action involves the dense action classification of variable-length skeleton sequences.
Current approaches apply graph-based networks to extract framewise, whole-body-level motion representations.
We propose a novel method named Language-assisted Human Part Motion Representation (LPL) to extract fine-grained part-level motion representations.
- Score: 11.759374280422113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skeleton-based Temporal Action Segmentation involves the dense action classification of variable-length skeleton sequences. Current approaches primarily apply graph-based networks to extract framewise, whole-body-level motion representations, and use one-hot encoded labels for model optimization. However, whole-body motion representations do not capture fine-grained part-level motion representations and the one-hot encoded labels neglect the intrinsic semantic relationships within the language-based action definitions. To address these limitations, we propose a novel method named Language-assisted Human Part Motion Representation Learning (LPL), which contains a Disentangled Part Motion Encoder (DPE) to extract dual-level (i.e., part and whole-body) motion representations and a Language-assisted Distribution Alignment (LDA) strategy for optimizing spatial relations within representations. Specifically, after part-aware skeleton encoding via DPE, LDA generates dual-level action descriptions to construct a textual embedding space with the help of a large-scale language model. Then, LDA motivates the alignment of the embedding space between text descriptions and motions. This alignment allows LDA not only to enhance intra-class compactness but also to transfer the language-encoded semantic correlations among actions to skeleton-based motion learning. Moreover, we propose a simple yet efficient Semantic Offset Adapter to smooth the cross-domain misalignment. Our experiments indicate that LPL achieves state-of-the-art performance across various datasets (e.g., +4.4\% Accuracy, +5.6\% F1 on the PKU-MMD dataset). Moreover, LDA is compatible with existing methods and improves their performance (e.g., +4.8\% Accuracy, +4.3\% F1 on the LARa dataset) without additional inference costs.
Related papers
- Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment [11.72557768532557]
Key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories.
Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.
arXiv Detail & Related papers (2024-09-22T06:44:58Z) - Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot
End-to-End Temporal Action Detection [10.012716326383567]
Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos.
We present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification.
We enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters.
arXiv Detail & Related papers (2023-11-01T00:17:37Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.