UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots
- URL: http://arxiv.org/abs/2512.24321v1
- Date: Tue, 30 Dec 2025 16:20:13 GMT
- Title: UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots
- Authors: Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang,
- Abstract summary: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility.<n>Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency.<n>This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions.
- Score: 27.794309591475326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation [54.7399209456857]
We present CLOT, a real-time whole-body humanoid teleoperation system that achieves closed-loop global motion tracking.<n>CLOT synchronizes operator and robot poses in a closed loop, enabling drift-free human-to-humanoid mimicry over long timehorizons.<n>We propose a data-driven randomization strategy that decouples observation trajectories from reward evaluation, enabling smooth and stable global corrections.
arXiv Detail & Related papers (2026-02-13T12:03:13Z) - FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions [147.04372611893032]
We present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language.<n>We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots.<n>Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation.
arXiv Detail & Related papers (2026-01-19T07:59:32Z) - MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z) - Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary [59.98573566227095]
We introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots.<n>Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability.
arXiv Detail & Related papers (2025-11-28T08:11:24Z) - From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance [55.31807046722006]
Existing language-guided humanoid pipelines are cumbersome and untrustworthy.<n>We present RoboGhost, a language-free framework that conditions humanoid policies on language-grounded motion latents.<n>We show that RoboGhost substantially reduces deployment latency, improves success rates and tracking precision, and produces smooth, semantically aligned humanoids.
arXiv Detail & Related papers (2025-10-16T17:57:47Z) - DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation [29.519071338337685]
We present DemoHLM, a framework for humanoid loco-manipulation on a real humanoid robot from a single demonstration in simulation.<n>whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot.<n> Experiments show a positive correlation between the amount of synthetic data and policy performance.
arXiv Detail & Related papers (2025-10-13T10:49:40Z) - Pixel Motion Diffusion is What We Need for Robot Control [38.925028601732116]
DAWN is a unified diffusion-based framework for language-conditioned robotic manipulation.<n>It bridges high-level motion intent and low-level robot action via structured pixel motion representation.<n>DAWN achieves state-of-the-art results on the challenging CALVIN benchmark.
arXiv Detail & Related papers (2025-09-26T17:59:59Z) - KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control [30.738592041595933]
We present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy.<n>Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency.<n>We validate VMS specialization extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions.
arXiv Detail & Related papers (2025-09-20T11:31:14Z) - Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots [13.229028132036321]
Masked Humanoid Controller (MHC) supports standing, walking, and mimicry of whole and partial-body motions.<n>MHC imitates partially masked motions from a library of behaviors spanning standing, walking, optimized reference trajectories, re-targeted video clips, and human motion capture data.<n>We demonstrate sim-to-real transfer on the real-world Digit V3 humanoid robot.
arXiv Detail & Related papers (2024-07-30T09:10:24Z) - InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs.
We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.