FP3: A 3D Foundation Policy for Robotic Manipulation
- URL: http://arxiv.org/abs/2503.08950v1
- Date: Tue, 11 Mar 2025 23:01:08 GMT
- Title: FP3: A 3D Foundation Policy for Robotic Manipulation
- Authors: Rujia Yang, Geng Chen, Chuan Wen, Yang Gao,
- Abstract summary: We introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation.<n>With only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects.
- Score: 12.115347477632783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.
Related papers
- PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation [48.807071017228964]
We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows.<n>With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation.<n>We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation.
arXiv Detail & Related papers (2026-01-07T10:29:12Z) - Large Video Planner Enables Generalizable Robot Control [117.49024534548319]
General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
arXiv Detail & Related papers (2025-12-17T18:35:54Z) - SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding [78.12178144115224]
Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control.<n>We propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.<n>We introduce our main contribution, $textbfSPEAR-1$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control.
arXiv Detail & Related papers (2025-11-21T17:09:43Z) - 4D Visual Pre-training for Robot Learning [71.22906081161324]
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years.<n>However, these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world.<n>We are seeking a general visual pre-training framework that could improve all 3D representations as an alternative.<n>Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning.
arXiv Detail & Related papers (2025-08-24T07:06:56Z) - 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model [40.730112146035076]
A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills.<n>Current robot datasets often record robot action in different action spaces within a simple scene.<n>We learn a 3D flow world model from both human and robot manipulation data.
arXiv Detail & Related papers (2025-06-06T16:00:31Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Pre-training Auto-regressive Robotic Models with 4D Representations [43.80798244473759]
ARM4R is an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model.<n>Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.
arXiv Detail & Related papers (2025-02-18T18:59:01Z) - Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation [30.744137117668643]
Lift3D is a framework that enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy.<n>In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.
arXiv Detail & Related papers (2024-11-27T18:59:52Z) - $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration.
We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video.
Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion.
We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z) - Robo360: A 3D Omnispective Multi-Material Robotic Manipulation Dataset [26.845899347446807]
Recent interest in leveraging 3D algorithms has led to advancements in robot perception and physical understanding.
We present Robo360, a dataset that features robotic manipulation with a dense view coverage.
We hope that Robo360 can open new research directions yet to be explored at the intersection of understanding the physical world in 3D and robot control.
arXiv Detail & Related papers (2023-12-09T09:12:03Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - ExAug: Robot-Conditioned Navigation Policies via Geometric Experience
Augmentation [73.63212031963843]
We propose a novel framework, ExAug, to augment the experiences of different robot platforms from multiple datasets in diverse environments.
The trained policy is evaluated on two new robot platforms with three different cameras in indoor and outdoor environments with obstacles.
arXiv Detail & Related papers (2022-10-14T01:32:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.