Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
- URL: http://arxiv.org/abs/2511.15379v1
- Date: Wed, 19 Nov 2025 12:11:36 GMT
- Title: Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
- Authors: Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang,
- Abstract summary: ZOMG is a framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning.<n>ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization.<n>Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark.
- Score: 39.7658823121591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
Related papers
- Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning [56.6025512458557]
Motion-language retrieval aims to bridge the semantic gap between natural language and human motion.<n>Existing approaches predominantly focus on aligning entire motion sequences with global textual representations.<n>We propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval.
arXiv Detail & Related papers (2026-01-29T16:00:12Z) - Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools [41.993750134878766]
Video-STAR is a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition.<n>Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching.<n>Our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference.
arXiv Detail & Related papers (2025-10-09T17:20:44Z) - Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation [11.759374280422113]
Skeleton-based Temporal Action involves the dense action classification of variable-length skeleton sequences.
Current approaches apply graph-based networks to extract framewise, whole-body-level motion representations.
We propose a novel method named Language-assisted Human Part Motion Representation (LPL) to extract fine-grained part-level motion representations.
arXiv Detail & Related papers (2024-10-08T20:42:51Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Semantics-aware Motion Retargeting with Vision-Language Models [19.53696208117539]
We present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics.
We utilize a differentiable module to render 3D motions and the high-level motion semantics are incorporated into the motion process by feeding the vision-language model and aligning the extracted semantic embeddings.
To ensure the preservation of fine-grained motion details and high-level semantics, we adopt two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints.
arXiv Detail & Related papers (2023-12-04T15:23:49Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.