Related papers: UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

URL: http://arxiv.org/abs/2512.24321v1
Date: Tue, 30 Dec 2025 16:20:13 GMT
Title: UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots
Authors: Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang,
Abstract summary: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility.<n>Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency.<n>This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions.
Score: 27.794309591475326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

Related papers

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z)
CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation [54.7399209456857]
We present CLOT, a real-time whole-body humanoid teleoperation system that achieves closed-loop global motion tracking.<n>CLOT synchronizes operator and robot poses in a closed loop, enabling drift-free human-to-humanoid mimicry over long timehorizons.<n>We propose a data-driven randomization strategy that decouples observation trajectories from reward evaluation, enabling smooth and stable global corrections.
arXiv Detail & Related papers (2026-02-13T12:03:13Z)
FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions [147.04372611893032]
We present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language.<n>We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots.<n>Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation.
arXiv Detail & Related papers (2026-01-19T07:59:32Z)
MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z)
Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary [59.98573566227095]
We introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots.<n>Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability.
arXiv Detail & Related papers (2025-11-28T08:11:24Z)
From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance [55.31807046722006]
Existing language-guided humanoid pipelines are cumbersome and untrustworthy.<n>We present RoboGhost, a language-free framework that conditions humanoid policies on language-grounded motion latents.<n>We show that RoboGhost substantially reduces deployment latency, improves success rates and tracking precision, and produces smooth, semantically aligned humanoids.
arXiv Detail & Related papers (2025-10-16T17:57:47Z)
DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation [29.519071338337685]
We present DemoHLM, a framework for humanoid loco-manipulation on a real humanoid robot from a single demonstration in simulation.<n>whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot.<n> Experiments show a positive correlation between the amount of synthetic data and policy performance.
arXiv Detail & Related papers (2025-10-13T10:49:40Z)
Pixel Motion Diffusion is What We Need for Robot Control [38.925028601732116]
DAWN is a unified diffusion-based framework for language-conditioned robotic manipulation.<n>It bridges high-level motion intent and low-level robot action via structured pixel motion representation.<n>DAWN achieves state-of-the-art results on the challenging CALVIN benchmark.
arXiv Detail & Related papers (2025-09-26T17:59:59Z)
KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control [30.738592041595933]
We present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy.<n>Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency.<n>We validate VMS specialization extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions.
arXiv Detail & Related papers (2025-09-20T11:31:14Z)
Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots [13.229028132036321]
Masked Humanoid Controller (MHC) supports standing, walking, and mimicry of whole and partial-body motions.<n>MHC imitates partially masked motions from a library of behaviors spanning standing, walking, optimized reference trajectories, re-targeted video clips, and human motion capture data.<n>We demonstrate sim-to-real transfer on the real-world Digit V3 humanoid robot.
arXiv Detail & Related papers (2024-07-30T09:10:24Z)
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.