Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
- URL: http://arxiv.org/abs/2602.12684v1
- Date: Fri, 13 Feb 2026 07:30:43 GMT
- Title: Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
- Authors: Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou,
- Abstract summary: We introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution.<n>Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities.<n>We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation.
- Score: 32.93468341343403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io
Related papers
- Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons [69.87766750714945]
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations.<n>We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision.<n>Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints.
arXiv Detail & Related papers (2026-03-02T17:38:58Z) - World Action Models are Zero-shot Policies [111.91938055103633]
We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone.<n>By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data.<n>We demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance.
arXiv Detail & Related papers (2026-02-17T15:04:02Z) - MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training [40.45924128424013]
We propose MimicDreamer, a framework that turns low-cost human demonstrations into robot-usable supervision.<n>For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos.<n>For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography.<n>For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver.
arXiv Detail & Related papers (2025-09-26T11:05:10Z) - Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z) - ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z) - Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics [55.05920313034645]
We introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.<n>Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions.<n>Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
arXiv Detail & Related papers (2025-05-29T16:41:12Z) - FAST: Efficient Action Tokenization for Vision-Language-Action Models [98.15494168962563]
We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform.<n>Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
arXiv Detail & Related papers (2025-01-16T18:57:04Z) - IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z) - RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation [39.44358155600282]
We introduce RoboMamba, an end-to-end robotic VLA model that delivers both robotic reasoning and action capabilities.<n>Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training.<n>We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters.
arXiv Detail & Related papers (2024-06-06T17:59:47Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.