Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary
- URL: http://arxiv.org/abs/2511.22963v1
- Date: Fri, 28 Nov 2025 08:11:24 GMT
- Title: Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary
- Authors: Zhirui Liu, Kaiyang Ji, Ke Yang, Jingyi Yu, Ye Shi, Jingya Wang,
- Abstract summary: We introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots.<n>Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability.
- Score: 59.98573566227095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.
Related papers
- TextOp: Real-time Interactive Text-Driven Humanoid Robot Motion Generation and Control [62.93681680333618]
TextOp is a real-time text-driven humanoid motion generation and control framework.<n>It supports streaming language commands and on-the-fly instruction modification during execution.<n>By bridging interactive motion generation with robust whole-body control, TextOp unlocks free-form intent expression.
arXiv Detail & Related papers (2026-02-07T08:42:11Z) - FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions [147.04372611893032]
We present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language.<n>We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots.<n>Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation.
arXiv Detail & Related papers (2026-01-19T07:59:32Z) - MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z) - SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control [31.180948030479797]
We present a fully end-to-end language-action model for humanoid whole-body control.<n>We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller.<n>The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation.
arXiv Detail & Related papers (2025-11-24T15:48:59Z) - From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance [55.31807046722006]
Existing language-guided humanoid pipelines are cumbersome and untrustworthy.<n>We present RoboGhost, a language-free framework that conditions humanoid policies on language-grounded motion latents.<n>We show that RoboGhost substantially reduces deployment latency, improves success rates and tracking precision, and produces smooth, semantically aligned humanoids.
arXiv Detail & Related papers (2025-10-16T17:57:47Z) - GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [133.23509142762356]
General-purpose robots need a versatile body and an intelligent mind.<n>Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy.<n>We introduce GR00T N1, an open foundation model for humanoid robots.
arXiv Detail & Related papers (2025-03-18T21:06:21Z) - EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning [10.266351600604612]
This paper introduces a framework, called EMOTION, for generating expressive motion sequences in humanoid robots.
We conduct online user studies comparing the naturalness and understandability of the motions generated by EMOTION and its human-feedback version, EMOTION++.
arXiv Detail & Related papers (2024-10-30T17:22:45Z) - "No, to the Right" -- Online Language Corrections for Robotic
Manipulation via Shared Autonomy [70.45420918526926]
We present LILAC, a framework for incorporating and adapting to natural language corrections online during execution.
Instead of discrete turn-taking between a human and robot, LILAC splits agency between the human and robot.
We show that our corrections-aware approach obtains higher task completion rates, and is subjectively preferred by users.
arXiv Detail & Related papers (2023-01-06T15:03:27Z) - LaTTe: Language Trajectory TransformEr [33.7939079214046]
This work proposes a flexible language-based framework to modify generic 3D robotic trajectories.
We employ an auto-regressive transformer to map natural language inputs and contextual images into changes in 3D trajectories.
We show through simulations and real-life experiments that the model can successfully follow human intent.
arXiv Detail & Related papers (2022-08-04T22:43:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.