Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations
- URL: http://arxiv.org/abs/2509.20703v1
- Date: Thu, 25 Sep 2025 03:11:07 GMT
- Title: Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations
- Authors: Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi,
- Abstract summary: We propose a framework for grasp pose generation and object trajectory imitation under the video-based Learning-from-Demonstration (LfD) paradigm.<n>Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides.<n>We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
- Score: 8.133207162076877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning-from-Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to $\SE(3)$ for probabilistic modeling of object trajectories, enabling density-aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
Related papers
- AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis [33.90053396451562]
We introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis.<n>Our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling.<n>Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies.
arXiv Detail & Related papers (2025-12-12T18:59:45Z) - UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z) - MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance [17.295532380360992]
Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors.<n>We propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules.<n>This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.
arXiv Detail & Related papers (2025-09-18T16:40:47Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z) - Conditional Neural Expert Processes for Learning Movement Primitives from Demonstration [1.9336815376402723]
Conditional Neural Expert Processes (CNEP) learns to assign demonstrations from different modes to distinct expert networks.
CNEP does not require supervision on which mode the trajectories belong to.
Our system is capable of on-the-fly adaptation to environmental changes via an online conditioning mechanism.
arXiv Detail & Related papers (2024-02-13T12:52:02Z) - Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects [14.446751610174868]
Movement Primitive Diffusion (MPD) is a novel method for imitation learning (IL) in robot-assisted surgery.
MPD combines the versatility of diffusion-based imitation learning (DIL) with the high-quality motion generation capabilities of Probabilistic Dynamic Movement Primitives (ProDMPs)
We evaluate MPD across various simulated and real world robotic tasks on both state and image observations.
arXiv Detail & Related papers (2023-12-15T18:24:28Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Hierarchical Generation of Human-Object Interactions with Diffusion
Probabilistic Models [71.64318025625833]
This paper presents a novel approach to generating the 3D motion of a human interacting with a target object.
Our framework first generates a set of milestones and then synthesizes the motion along them.
The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity.
arXiv Detail & Related papers (2023-10-03T17:50:23Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z) - SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp
and motion optimization through diffusion [34.25379651790627]
This work introduces a method for learning data-driven SE(3) cost functions as diffusion models.
We focus on learning SE(3) diffusion models for 6DoF grasping, giving rise to a novel framework for joint grasp and motion optimization.
arXiv Detail & Related papers (2022-09-08T14:50:23Z) - Learning to Shift Attention for Motion Generation [55.61994201686024]
One challenge of motion generation using robot learning from demonstration techniques is that human demonstrations follow a distribution with multiple modes for one task query.
Previous approaches fail to capture all modes or tend to average modes of the demonstrations and thus generate invalid trajectories.
We propose a motion generation model with extrapolation ability to overcome this problem.
arXiv Detail & Related papers (2021-02-24T09:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.