Related papers: VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

URL: http://arxiv.org/abs/2512.16724v1
Date: Thu, 18 Dec 2025 16:26:17 GMT
Title: VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
Authors: Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang,
Abstract summary: Multi-camera setup increases computational costs and forces the model to spend extra training time extracting task-relevant details.<n>We propose the VERM (Virtual Eye for Robotic Manipulation) method, which imagines a virtual task-adaptive view from the constructed 3D point cloud.<n>To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure.
Score: 9.95654157461894
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .

Related papers

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining [4.039082584778385]
We introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP)<n>From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates.<n>The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns.
arXiv Detail & Related papers (2026-01-31T23:32:54Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model [1.4525559282354221]
We introduce a novel MDE system, called EfficientDepth, which combines a transformer architecture with a lightweight convolutional decoder.<n>We train our model on a combination of labeled synthetic and real images, as well as pseudo-labeled real images, generated using a high-performing MDE method.<n>In addition to commonly used objectives, we introduce a loss function based on LPIPS to encourage the network to produce detailed depth maps.
arXiv Detail & Related papers (2025-09-26T16:05:43Z)
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots [55.43376513158555]
Camera Depth Models (CDMs) are a simple plugin on daily-use depth cameras.<n>We develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern.<n>For the first time, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots.
arXiv Detail & Related papers (2025-09-02T17:29:38Z)
CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations [19.71090711790973]
We propose a novel 3D pre-training framework designed to enhance robotic manipulation policies.<n>Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder.<n>We mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time.
arXiv Detail & Related papers (2025-07-11T02:16:32Z)
EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation [44.08442553098017]
EmbodiedMAE is a unified 3D representation for robot manipulation.<n>EmbodiedMAE consistently outperforms state-of-the-art vision foundation models.
arXiv Detail & Related papers (2025-05-15T09:12:17Z)
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation [2.434849352801735]
Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation.<n>But challenges persist in executing complex fine manipulation tasks with high speed and precision.<n>We introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation.
arXiv Detail & Related papers (2025-03-07T00:55:42Z)
3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.<n>VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.<n>We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z)
LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR. SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds. We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z)
Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints. We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z)
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.