SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
- URL: http://arxiv.org/abs/2511.10091v1
- Date: Fri, 14 Nov 2025 01:31:32 GMT
- Title: SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
- Authors: Qilang Ye, Yu Zhou, Lian He, Jie Zhang, Xuanming Guo, Jiayu Zhang, Mingkui Tan, Weicheng Xie, Yue Sun, Tao Tan, Xiaochen Yuan, Ghada Khoriba, Zitong Yu,
- Abstract summary: We introduce visUal-motion knowledGe for Action Recognition (SUGAR)<n>In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions.<n>We use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions.
- Score: 70.56416162106036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.
Related papers
- Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning [11.901989132359676]
We introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual reinforcement learning (RL)<n>Semore simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows.<n>Our method exhibits efficient and adaptive ability compared to state-of-art methods.
arXiv Detail & Related papers (2025-12-04T16:54:41Z) - If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond [19.074841631219233]
Self-supervised learning (SSL) has been proven effective for skeleton-based action understanding.
In this paper, we conduct a comprehensive survey on self-supervised skeleton-based action representation learning.
arXiv Detail & Related papers (2024-06-05T06:21:54Z) - Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning [20.34477942813382]
Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences.
We introduce a novel skeleton-based training framework based on Cross-modal Contrastive learning.
Our method outperforms the previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-05-31T03:40:15Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - What In-Context Learning "Learns" In-Context: Disentangling Task
Recognition and Task Learning [24.395288160951118]
Large language models (LLMs) exploit in-context learning (ICL) to solve tasks with only a few demonstrations.
We characterize two ways through which ICL leverages demonstrations.
We show that models can achieve non-trivial performance with only TR, and TR does not further improve with larger models or more demonstrations.
arXiv Detail & Related papers (2023-05-16T18:05:19Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z) - MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action
Recognition [36.74293548921099]
We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects.
Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
arXiv Detail & Related papers (2020-10-12T11:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.