Related papers: VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

URL: http://arxiv.org/abs/2510.15530v4
Date: Mon, 03 Nov 2025 10:10:38 GMT
Title: VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
Authors: Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, Bin He,
Abstract summary: Vision-Only and single-view Diffusion Policy learning method (VO-DP)<n>We propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP)
Score: 16.138701713455756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.

Related papers

Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds [55.5576033344795]
We propose a novel DualBranch textbfCentertextbfSurrounding textbfContrast (CSCon) framework for 3D point clouds.<n>Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods.<n>Our method attains state-of-the-art results, even surpassing cross-modal approaches.
arXiv Detail & Related papers (2025-12-09T14:56:35Z)
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models [37.699828966838986]
BridgeVLA is a novel 3D VLA model that projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone.<n>It utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space.<n>It is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency.
arXiv Detail & Related papers (2025-06-09T17:36:34Z)
Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation [27.25933965875881]
LiDAR-based 3D Human Pose Estimation is becoming a research focus.<n>Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results.<n>We propose a simple yet powerful method, which provides insights both on modeling and augmentation of point clouds.
arXiv Detail & Related papers (2024-12-18T02:54:30Z)
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents. We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z)
IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images [50.4538089115248]
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods.
arXiv Detail & Related papers (2024-03-30T07:17:37Z)
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [19.41216557646392]
3D Diffusion Policy (DP3) is a novel visual imitation learning approach. In experiments, DP3 handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do.
arXiv Detail & Related papers (2024-03-06T18:58:49Z)
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z)
DDP: Diffusion Model for Dense Visual Prediction [71.55770562024782]
We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods.
arXiv Detail & Related papers (2023-03-30T17:26:50Z)
Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.