Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
- URL: http://arxiv.org/abs/2506.01943v2
- Date: Fri, 04 Jul 2025 04:06:12 GMT
- Title: Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
- Authors: Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin,
- Abstract summary: We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
- Score: 72.00655365269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
Related papers
- TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation [18.083105886634115]
TASTE-Rob is a dataset of 100,856 ego-centric hand-object interaction videos.<n>Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint.<n>To enhance realism, we introduce a three-stage pose-refinement pipeline.
arXiv Detail & Related papers (2025-03-14T14:09:31Z) - C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation [81.4106601222722]
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation.<n>We propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag.<n>Our method includes an object perception module and a Chain-of-Thought-based motion reasoning module.
arXiv Detail & Related papers (2025-02-27T08:21:03Z) - CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph [12.54884302440877]
Mobile exploration is a longstanding challenge in robotics.<n>Existing robotic exploration approaches via active interaction are often restricted to tabletop scenes.<n>We introduce a 3D relational object graph that encodes diverse object relations and enables exploration through active interaction.
arXiv Detail & Related papers (2025-01-23T02:39:04Z) - Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model [25.00532805042292]
We propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences.<n>We introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance.<n> Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences.
arXiv Detail & Related papers (2024-12-30T02:21:43Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [57.942404069484134]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.<n>Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.<n>We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - V-MAO: Generative Modeling for Multi-Arm Manipulation of Articulated
Objects [51.79035249464852]
We present a framework for learning multi-arm manipulation of articulated objects.
Our framework includes a variational generative model that learns contact point distribution over object rigid parts for each robot arm.
arXiv Detail & Related papers (2021-11-07T02:31:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.