GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models
- URL: http://arxiv.org/abs/2601.07632v2
- Date: Wed, 14 Jan 2026 02:19:38 GMT
- Title: GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models
- Authors: Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang, Yanfu Zhang, Shangqian Gao, Xin Liu,
- Abstract summary: We argue that alignment is most effective when both modalities share a unified geometric basis.<n>We employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage.<n>Our framework achieves a 20% performance improvement over current state-of-the-art methods.
- Score: 23.159388800893964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM's capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.
Related papers
- FreeAct: Freeing Activations for LLM Quantization [89.97086263978058]
Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models.<n>FreeAct is a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities.<n>Experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement.
arXiv Detail & Related papers (2026-03-02T12:02:17Z) - Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z) - Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints [12.704390013489054]
We study zero-shot 3D alignment of two given meshes, using a text prompt describing their relation.<n>We optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients.<n>Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
arXiv Detail & Related papers (2026-01-20T18:12:55Z) - SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion [23.86761713752287]
Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks.<n>Most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space.<n>We propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding.
arXiv Detail & Related papers (2025-11-21T15:24:33Z) - FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models [80.6268239673988]
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity.<n>Existing methods lack the flexibility to modulate this reasoning strength.<n>We propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning.
arXiv Detail & Related papers (2025-10-13T09:22:12Z) - ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models [18.720486146234077]
Large Language Models (LLMs) have emerged as a powerful paradigm for Knowledge Graph Completion (KGC)<n>We propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization.<n>We show that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.
arXiv Detail & Related papers (2025-10-10T04:36:13Z) - OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment [79.98946571424607]
We present OmniBridge, a unified framework that supports vision-language understanding, generation, and retrieval within a unified architecture.<n>To address the challenge of task interference, we propose a two-stage decoupled training strategy.<n>Experiments demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks.
arXiv Detail & Related papers (2025-09-23T13:57:55Z) - ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation [12.059517583878756]
We propose ReSem3D, a unified manipulation framework for semantically diverse environments.<n>We show that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization.
arXiv Detail & Related papers (2025-07-24T10:07:31Z) - Navigating Motion Agents in Dynamic and Cluttered Environments through LLM Reasoning [69.5875073447454]
This paper advances motion agents empowered by large language models (LLMs) toward autonomous navigation in dynamic and cluttered environments.<n>Our training-free framework supports multi-agent coordination, closed-loop replanning, and dynamic obstacle avoidance without retraining or fine-tuning.
arXiv Detail & Related papers (2025-03-10T13:39:09Z) - CoMMIT: Coordinated Multimodal Instruction Tuning [90.1532838391285]
Multimodal large language models (MLLMs) generally involve cooperative learning between a backbone LLM and a feature encoder of non-text input modalities.<n>In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.<n>We propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.