Related papers: DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

URL: http://arxiv.org/abs/2512.14217v1
Date: Tue, 16 Dec 2025 09:11:36 GMT
Title: DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos
Authors: Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok,
Abstract summary: Video models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation.<n>We present DRAW2ACT, a trajectory-conditioned video generation framework that extracts multiple representations from the input trajectory.<n>We show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
Score: 24.681248200255975
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Related papers

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation [49.12018869332346]
InfCam is a camera-controlled video-to-video generation framework with high pose fidelity.<n>The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model.
arXiv Detail & Related papers (2025-12-18T20:03:05Z)
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs [5.109732854501585]
We introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations.<n>Our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
arXiv Detail & Related papers (2025-12-17T18:47:31Z)
VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification [65.15340059997273]
VHOI is a framework for creating realistic human-object interactions in video.<n>We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics.<n> Experiments demonstrate state-of-the-art results in controllable HOI video generation.
arXiv Detail & Related papers (2025-12-10T13:40:24Z)
Real-Time Motion-Controllable Autoregressive Video Diffusion [79.32730467857535]
We propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control.<n>We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement with a trajectory-based reward model.<n>Our design preserves the Markov property through a Self-Rollout learning mechanism and accelerates training by selectively denoising steps.
arXiv Detail & Related papers (2025-10-09T12:17:11Z)
PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z)
Pixel Motion Diffusion is What We Need for Robot Control [38.925028601732116]
DAWN is a unified diffusion-based framework for language-conditioned robotic manipulation.<n>It bridges high-level motion intent and low-level robot action via structured pixel motion representation.<n>DAWN achieves state-of-the-art results on the challenging CALVIN benchmark.
arXiv Detail & Related papers (2025-09-26T17:59:59Z)
Vidar: Embodied Video Diffusion Model for Generalist Manipulation [28.216910600346512]
Vidar is a prior-driven, low-shot adaptation paradigm that replaces most embodiment-specific data with transferable video priors.<n>Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors + minimal on-robot alignment.
arXiv Detail & Related papers (2025-07-17T08:31:55Z)
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations [19.28925489415787]
RIGVid enables robots to perform complex manipulation tasks by imitating AI-generated videos.<n>A video diffusion model generates potential demonstration videos, and a vision-language model automatically filters out results that do not follow the command.<n>A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion.
arXiv Detail & Related papers (2025-07-01T17:39:59Z)
ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis.<n>To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space.<n>Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z)
Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes. We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature. We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.