InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
- URL: http://arxiv.org/abs/2510.13778v1
- Date: Wed, 15 Oct 2025 17:30:05 GMT
- Title: InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
- Authors: Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu,
- Abstract summary: We introduce InternVLA-M1, a unified framework for spatial grounding and robot control.<n>InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data, and (ii) spatially guided action post-training.<n>Results: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka.
- Score: 138.89177083578213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.
Related papers
- Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons [69.87766750714945]
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations.<n>We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision.<n>Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints.
arXiv Detail & Related papers (2026-03-02T17:38:58Z) - Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation [56.30931340537373]
MolmoSpaces is a fully open ecosystem to support benchmarking of robot policies.<n>MolmoSpaces consists of over 230k diverse indoor environments.<n>MolmoSpaces-Bench is a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects.
arXiv Detail & Related papers (2026-02-11T20:16:31Z) - Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation [14.833622989644352]
We develop a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation.<n>Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint.<n>Results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation.
arXiv Detail & Related papers (2026-02-10T16:25:39Z) - SpaceVista: All-Scale Visual Spatial Reasoning from mm to km [43.506658643163405]
This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges.<n>The heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation.<n>We introduce a holistic solution that integrates a structured spatial reasoning system, scale-aware modeling, and a progressive training paradigm.
arXiv Detail & Related papers (2025-10-10T17:59:46Z) - RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [54.441878000440965]
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world.<n>We propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding.<n>RFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%.
arXiv Detail & Related papers (2025-06-04T17:59:27Z) - SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [45.03115608632622]
spatial understanding is the keypoint in robot manipulation.<n>We propose SpatialVLA to explore effective spatial representations for the robot foundation model.<n>We show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups.
arXiv Detail & Related papers (2025-01-27T07:34:33Z) - Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps [16.083092305930844]
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots.
We propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities.
We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments.
arXiv Detail & Related papers (2024-06-26T07:06:42Z) - Habitat 2.0: Training Home Assistants to Rearrange their Habitat [122.54624752876276]
We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments.
We make contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks.
arXiv Detail & Related papers (2021-06-28T05:42:15Z) - Sim-to-Real Transfer for Vision-and-Language Navigation [70.86250473583354]
We study the problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.
Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation.
To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot.
arXiv Detail & Related papers (2020-11-07T16:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.