Related papers: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

URL: http://arxiv.org/abs/2505.23171v1
Date: Thu, 29 May 2025 07:10:03 GMT
Title: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
Authors: Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, Zhizhong Su,
Abstract summary: RoboTransfer is a diffusion-based video generation framework for robotic data synthesis.<n>It integrates multi-view geometry with explicit control over scene components, such as background and object attributes.<n>RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity.
Score: 33.178540405656676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

Related papers

RoboPearls: Editable Video Simulation for Robot Manipulation [81.18434338506621]
RoboPearls is an editable video simulation framework for robotic manipulation.<n>Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations.<n>We conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot.
arXiv Detail & Related papers (2025-06-28T05:03:31Z)
RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping [26.010205882976624]
RoboSwap operates on unpaired data from diverse environments.<n>We segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another.<n>Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks.
arXiv Detail & Related papers (2025-06-10T09:46:07Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
TransAnimate: Taming Layer Diffusion to Generate RGBA Video [3.7031943280491997]
TransAnimate is an innovative framework that integrates RGBA image generation techniques with video generation modules.<n>We introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling.<n>We have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos.
arXiv Detail & Related papers (2025-03-23T04:27:46Z)
TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation [18.083105886634115]
TASTE-Rob is a dataset of 100,856 ego-centric hand-object interaction videos.<n>Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint.<n>To enhance realism, we introduce a three-stage pose-refinement pipeline.
arXiv Detail & Related papers (2025-03-14T14:09:31Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation [10.54770475137596]
We propose RoboUniView, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. We achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D to D$ setting from 93.0% to 96.2%, and in the $ABC to D$ setting from 92.2% to 94.2%.
arXiv Detail & Related papers (2024-06-27T08:13:33Z)
3D-MVP: 3D Multiview Pretraining for Robotic Manipulation [53.45111493465405]
We propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders.<n>We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions.
arXiv Detail & Related papers (2024-06-26T08:17:59Z)
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos.
arXiv Detail & Related papers (2024-03-19T17:47:37Z)
RVT: Robotic View Transformer for 3D Object Manipulation [46.25268237442356]
We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
arXiv Detail & Related papers (2023-06-26T17:59:31Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [52.94101901600948]
We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action" Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
arXiv Detail & Related papers (2022-09-12T17:51:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.