VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
- URL: http://arxiv.org/abs/2512.16724v1
- Date: Thu, 18 Dec 2025 16:26:17 GMT
- Title: VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
- Authors: Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang,
- Abstract summary: Multi-camera setup increases computational costs and forces the model to spend extra training time extracting task-relevant details.<n>We propose the VERM (Virtual Eye for Robotic Manipulation) method, which imagines a virtual task-adaptive view from the constructed 3D point cloud.<n>To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure.
- Score: 9.95654157461894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .
Related papers
- CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining [4.039082584778385]
We introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP)<n>From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates.<n>The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns.
arXiv Detail & Related papers (2026-01-31T23:32:54Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model [1.4525559282354221]
We introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder.<n>We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method.<n>In addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps.
arXiv Detail & Related papers (2025-09-26T16:05:43Z) - Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots [55.43376513158555]
Camera Depth Models (CDMs) are a simple plugin on daily-use depth cameras.<n>We develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern.<n>For the first time, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots.
arXiv Detail & Related papers (2025-09-02T17:29:38Z) - CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations [19.71090711790973]
We propose a novel 3D pre-training framework designed to enhance robotic manipulation policies.<n>Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder.<n>We mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time.
arXiv Detail & Related papers (2025-07-11T02:16:32Z) - EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation [44.08442553098017]
EmbodiedMAE is a unified 3D representation for robot manipulation.<n>EmbodiedMAE consistently outperforms state-of-the-art vision foundation models.
arXiv Detail & Related papers (2025-05-15T09:12:17Z) - Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation [2.434849352801735]
Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation.<n>But challenges persist in executing complex fine manipulation tasks with high speed and precision.<n>We introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation.
arXiv Detail & Related papers (2025-03-07T00:55:42Z) - 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.<n>VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.<n>We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.