Pixel Motion Diffusion is What We Need for Robot Control
- URL: http://arxiv.org/abs/2509.22652v1
- Date: Fri, 26 Sep 2025 17:59:59 GMT
- Title: Pixel Motion Diffusion is What We Need for Robot Control
- Authors: E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo,
- Abstract summary: DAWN is a unified diffusion-based framework for language-conditioned robotic manipulation.<n>It bridges high-level motion intent and low-level robot action via structured pixel motion representation.<n>DAWN achieves state-of-the-art results on the challenging CALVIN benchmark.
- Score: 38.925028601732116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://nero1342.github.io/DAWN/
Related papers
- D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [66.22412592525369]
We introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine.<n>We show that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values.<n>Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping.
arXiv Detail & Related papers (2026-03-01T15:32:04Z) - mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs [5.109732854501585]
We introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations.<n>Our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
arXiv Detail & Related papers (2025-12-17T18:47:31Z) - StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation [56.996371714721995]
We propose an unsupervised approach that learns a highly compressed two-token state representation.<n>Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models.<n>We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation.
arXiv Detail & Related papers (2025-10-06T17:37:24Z) - FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation [50.39748673817223]
We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in robot video generation.<n>First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity.<n>Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics.
arXiv Detail & Related papers (2025-09-29T03:30:40Z) - Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4036440289082]
Hand-object motion-capture (MoCap) offer large-scale, contact-rich demonstrations and hold promise for dexterous robotic scopes.<n>We introduce Dexplore, a unified single-loop optimization that performs repositories and tracking to learn robot control policies directly from MoCap at scale.
arXiv Detail & Related papers (2025-09-11T17:59:07Z) - Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z) - ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow [4.2766838326810355]
We present ViSA-Flow, a framework that learns pre-labeled representation from unsupervised large-scale video data.<n>First, a generative-trained semantic action flow is automatically extracted from large-scale human-object interaction video data.<n>Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline.
arXiv Detail & Related papers (2025-05-02T14:03:06Z) - FAST: Efficient Action Tokenization for Vision-Language-Action Models [98.15494168962563]
We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform.<n>Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
arXiv Detail & Related papers (2025-01-16T18:57:04Z) - RobotDiffuse: Motion Planning for Redundant Manipulator based on Diffusion Model [13.110235244912474]
Redundant manipulators offer enhanced kinematic performance and versatility.<n>Motion planning for these manipulators is challenging due to increased DOFs and complex, dynamic environments.<n>This paper introduces RobotDiffuse, a diffusion model-based approach for motion planning in redundant manipulators.
arXiv Detail & Related papers (2024-12-27T07:34:54Z) - ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation [38.08606358379297]
Diffusion models have been verified to be effective in generating complex distributions from natural images to motion trajectories.<n>Recent methods show impressive performance in 3D robotic manipulation tasks, whereas they suffer from severe runtime inefficiency due to multiple denoising steps.<n>We propose a real-time robotic manipulation model named ManiCM that imposes the consistency constraint to the diffusion process.
arXiv Detail & Related papers (2024-06-03T17:59:23Z) - DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model [72.66465487508556]
DiffGen is a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model.
It can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation.
Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
arXiv Detail & Related papers (2024-05-12T15:38:17Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.