Related papers: CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

URL: http://arxiv.org/abs/2602.00937v1
Date: Sat, 31 Jan 2026 23:32:54 GMT
Title: CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
Authors: I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang, Connor Schenck,
Abstract summary: We introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP)<n>From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates.<n>The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns.
Score: 4.039082584778385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.

Related papers

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation [9.95654157461894]
Multi-camera setup increases computational costs and forces the model to spend extra training time extracting task-relevant details.<n>We propose the VERM (Virtual Eye for Robotic Manipulation) method, which imagines a virtual task-adaptive view from the constructed 3D point cloud.<n>To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure.
arXiv Detail & Related papers (2025-12-18T16:26:17Z)
DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation [52.136378691610524]
We present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features.<n>By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation.<n>We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
arXiv Detail & Related papers (2025-10-28T10:17:11Z)
4D Visual Pre-training for Robot Learning [71.22906081161324]
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years.<n>However, these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world.<n>We are seeking a general visual pre-training framework that could improve all 3D representations as an alternative.<n>Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning.
arXiv Detail & Related papers (2025-08-24T07:06:56Z)
CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations [19.71090711790973]
We propose a novel 3D pre-training framework designed to enhance robotic manipulation policies.<n>Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder.<n>We mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time.
arXiv Detail & Related papers (2025-07-11T02:16:32Z)
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting [64.31900521467362]
No existing pre-training method is equally effective for both object- and scene-level point clouds.<n>We introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture.
arXiv Detail & Related papers (2025-06-11T17:23:21Z)
Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z)
3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks [19.026406684039006]
Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models to learn the mapping between RGB images, language instructions, and joint space control.<n>In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model.<n>Our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$%$.
arXiv Detail & Related papers (2025-05-09T05:32:40Z)
Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning [28.80962812015936]
Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution.<n>We propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms.<n>We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.
arXiv Detail & Related papers (2025-03-06T18:17:09Z)
SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR. SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds. We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z)
Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control. Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z)
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.