4D Visual Pre-training for Robot Learning
- URL: http://arxiv.org/abs/2508.17230v2
- Date: Sat, 06 Sep 2025 14:24:21 GMT
- Title: 4D Visual Pre-training for Robot Learning
- Authors: Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, Huazhe Xu,
- Abstract summary: General visual representations learned from web-scale datasets for robotics have achieved great success in recent years.<n>However, these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world.<n>We are seeking a general visual pre-training framework that could improve all 3D representations as an alternative.<n>Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning.
- Score: 71.22906081161324
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d-visual-pretraining.github.io/
Related papers
- CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining [4.039082584778385]
We introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP)<n>From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates.<n>The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns.
arXiv Detail & Related papers (2026-01-31T23:32:54Z) - FP3: A 3D Foundation Policy for Robotic Manipulation [12.115347477632783]
We introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation.<n>With only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects.
arXiv Detail & Related papers (2025-03-11T23:01:08Z) - Pre-training Auto-regressive Robotic Models with 4D Representations [43.80798244473759]
ARM4R is an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model.<n>Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.
arXiv Detail & Related papers (2025-02-18T18:59:01Z) - 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation [53.45111493465405]
We propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders.<n>We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions.
arXiv Detail & Related papers (2024-06-26T08:17:59Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks.
These 2D models have been surpassed by 3D models on 3D computer vision benchmarks.
We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z) - T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts.
Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z) - CRAVES: Controlling Robotic Arm with a Vision-based Economic System [96.56564257199474]
Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry.<n>This work discusses the role of computer vision algorithms in this field.<n>We present an alternative solution, which uses a 3D model to create a large number of synthetic data.
arXiv Detail & Related papers (2018-12-03T13:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.