SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
- URL: http://arxiv.org/abs/2511.17411v1
- Date: Fri, 21 Nov 2025 17:09:43 GMT
- Title: SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
- Authors: Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel,
- Abstract summary: Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control.<n>We propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.<n>We introduce our main contribution, $textbfSPEAR-1$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control.
- Score: 78.12178144115224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.
Related papers
- PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation [48.807071017228964]
We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows.<n>With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation.<n>We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation.
arXiv Detail & Related papers (2026-01-07T10:29:12Z) - LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight [105.9472902251177]
We present a VLM-native recipe that casts 3D detection as a next-token prediction problem.<n>Our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement.
arXiv Detail & Related papers (2025-11-25T18:59:45Z) - 4D Visual Pre-training for Robot Learning [71.22906081161324]
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years.<n>However, these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world.<n>We are seeking a general visual pre-training framework that could improve all 3D representations as an alternative.<n>Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning.
arXiv Detail & Related papers (2025-08-24T07:06:56Z) - EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation [44.08442553098017]
EmbodiedMAE is a unified 3D representation for robot manipulation.<n>EmbodiedMAE consistently outperforms state-of-the-art vision foundation models.
arXiv Detail & Related papers (2025-05-15T09:12:17Z) - Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp"<n>It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z) - FP3: A 3D Foundation Policy for Robotic Manipulation [12.115347477632783]
We introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation.<n>With only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects.
arXiv Detail & Related papers (2025-03-11T23:01:08Z) - From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs [64.28181017898369]
LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views.<n>LIFT-GS achieves state-of-the-art results with $25.7%$ mAP on open-vocabulary instance segmentation.<n>Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties.
arXiv Detail & Related papers (2025-02-27T18:59:11Z) - Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation [30.744137117668643]
Lift3D is a framework that enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy.<n>In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.
arXiv Detail & Related papers (2024-11-27T18:59:52Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.