Related papers: Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

URL: http://arxiv.org/abs/2601.21904v3
Date: Wed, 04 Feb 2026 13:24:36 GMT
Title: Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning
Authors: Hanmo Chen, Guangtao Lyu, Chenghao Xu, Jiexi Yan, Xu Yang, Cheng Deng,
Abstract summary: Motion-language retrieval aims to bridge the semantic gap between natural language and human motion.<n>Existing approaches predominantly focus on aligning entire motion sequences with global textual representations.<n>We propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval.
Score: 56.6025512458557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.

Related papers

3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control [3.606473077857744]
3DGesPolicy is an action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem.<n>By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns.<n>To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module.
arXiv Detail & Related papers (2026-01-26T12:57:36Z)
PALUM: Part-based Attention Learning for Unified Motion Retargeting [53.17113525688095]
Remotion between characters with different skeleton structures is a fundamental challenge in computer animation.<n>We present a novel approach that learns common motion representations across diverse skeleton topologies.<n>Experiments demonstrate superior performance in handling diverse skeletal structures while maintaining motion realism and semantic fidelity.
arXiv Detail & Related papers (2026-01-12T07:29:44Z)
EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning [66.68366281305977]
This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction.<n>Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions.<n>We propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles.
arXiv Detail & Related papers (2025-03-01T07:15:10Z)
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis [19.764460501254607]
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion.<n>We propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis.
arXiv Detail & Related papers (2024-12-21T10:16:07Z)
KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Current human motion synthesis frameworks rely on global action descriptions.<n>A single coarse description, such as run, fails to capture details such as variations in speed, limb positioning, and kinematic dynamics.<n>We introduce KinMo, a unified framework built on a hierarchical describable motion representation.
arXiv Detail & Related papers (2024-11-23T06:50:11Z)
TextIM: Part-aware Interactive Motion Synthesis from Text [25.91739105467082]
TextIM is a novel framework for synthesizing TEXT-driven human Interactive Motions. Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts. For training and evaluation, we carefully selected and re-labeled interactive motions from HUMANML3D to develop a specialized dataset.
arXiv Detail & Related papers (2024-08-06T17:08:05Z)
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description. Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models. We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously. We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.