MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
- URL: http://arxiv.org/abs/2602.11337v2
- Date: Thu, 19 Feb 2026 00:59:08 GMT
- Title: MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
- Authors: Yejin Kim, Wilbert Pumacay, Omar Rayyan, Max Argus, Winson Han, Eli VanderBilt, Jordi Salvador, Abhay Deshpande, Rose Hendrix, Snehal Jauhri, Shuo Liu, Nur Muhammad Mahi Shafiullah, Maya Guru, Ainaz Eftekhar, Karen Farley, Donovan Clay, Jiafei Duan, Arjun Guru, Piper Wolters, Alvaro Herrasti, Ying-Chun Lee, Georgia Chalvatzaki, Yuchen Cui, Ali Farhadi, Dieter Fox, Ranjay Krishna,
- Abstract summary: MolmoSpaces is a fully open ecosystem to support benchmarking of robot policies.<n>MolmoSpaces consists of over 230k diverse indoor environments.<n>MolmoSpaces-Bench is a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects.
- Score: 56.30931340537373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \r{ho} = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.
Related papers
- SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes [19.995619927680476]
SceneSmith builds environments from architectural layout to natural furniture population.<n>SceneSmith generates more objects than prior methods, with 2% interobject collisions and 96% of objects remaining stable under physics simulation.<n>SceneSmith environments can be used in an end-to-end pipeline for automatic policy evaluation.
arXiv Detail & Related papers (2026-02-09T19:56:04Z) - Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z) - InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy [138.89177083578213]
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control.<n>InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data, and (ii) spatially guided action post-training.<n>Results: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka.
arXiv Detail & Related papers (2025-10-15T17:30:05Z) - Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning [5.740131013400576]
We propose Meta-Memory, a large language model (LLM)-driven agent that constructs a high-density memory representation of the environment.<n>The key innovation of Meta-Memory lies in its capacity to retrieve and integrate relevant memories through joint reasoning over semantic and spatial modalities.<n> Experimental results show that Meta-Memory significantly outperforms state-of-the-art methods on both the SpaceLocQA and the public NaVQA benchmarks.
arXiv Detail & Related papers (2025-09-25T05:22:52Z) - λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics [11.901933884058021]
We introduce the LAMBDA benchmark-Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities.<n>Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings.<n>We find that learning methods, even when pretrained, yield lower success rates, while a neuro-symbolic method performs significantly better and requires less data.
arXiv Detail & Related papers (2024-11-28T19:31:50Z) - M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes [66.44171200767839]
M3Bench is a new benchmark for whole-body motion generation in mobile manipulation tasks.<n>M3Bench features 30,000 object rearrangement tasks across 119 diverse scenes.<n>M3Bench and M3BenchMaker aim to advance robotics research toward more adaptive and capable mobile manipulation.
arXiv Detail & Related papers (2024-10-09T08:38:21Z) - Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps [16.083092305930844]
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots.
We propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities.
We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments.
arXiv Detail & Related papers (2024-06-26T07:06:42Z) - RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots [25.650235551519952]
We present RoboCasa, a large-scale simulation framework for training generalist robots in everyday environments.
We provide thousands of 3D assets across over 150 object categories and dozens of interactable furniture and appliances.
Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning.
arXiv Detail & Related papers (2024-06-04T17:41:31Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - Learning to Move with Affordance Maps [57.198806691838364]
The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent.
Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry.
We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.
arXiv Detail & Related papers (2020-01-08T04:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.