RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
- URL: http://arxiv.org/abs/2506.08632v1
- Date: Tue, 10 Jun 2025 09:46:07 GMT
- Title: RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
- Authors: Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok,
- Abstract summary: RoboSwap operates on unpaired data from diverse environments.<n>We segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another.<n>Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks.
- Score: 26.010205882976624
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.
Related papers
- Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation [21.424029706788883]
We introduce Video Diffusion for Action Reasoning (Vidar)<n>We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms.<n>With only 20 minutes of human demonstrations on an unseen robot platform, Vidar generalizes to unseen tasks and backgrounds with strong semantic understanding.
arXiv Detail & Related papers (2025-07-17T08:31:55Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer [33.178540405656676]
RoboTransfer is a diffusion-based video generation framework for robotic data synthesis.<n>It integrates multi-view geometry with explicit control over scene components, such as background and object attributes.<n>RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity.
arXiv Detail & Related papers (2025-05-29T07:10:03Z) - DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories [120.25799361925387]
DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
arXiv Detail & Related papers (2025-05-19T04:55:39Z) - TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation [18.083105886634115]
TASTE-Rob is a dataset of 100,856 ego-centric hand-object interaction videos.<n>Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint.<n>To enhance realism, we introduce a three-stage pose-refinement pipeline.
arXiv Detail & Related papers (2025-03-14T14:09:31Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.