EnerVerse-AC: Envisioning Embodied Environments with Action Condition
- URL: http://arxiv.org/abs/2505.09723v1
- Date: Wed, 14 May 2025 18:30:53 GMT
- Title: EnerVerse-AC: Envisioning Embodied Environments with Action Condition
- Authors: Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, Guanghui Ren,
- Abstract summary: EnerVerse-AC is an action-conditional world model that generates future visual observations based on an agent's predicted actions.<n> EVAC augments human-collected trajectories into diverse datasets and generates realistic, action-conditioned video observations for policy testing.
- Score: 47.97500109323355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robotic imitation learning has advanced from solving static tasks to addressing dynamic interaction scenarios, but testing and evaluation remain costly and challenging due to the need for real-time interaction with dynamic environments. We propose EnerVerse-AC (EVAC), an action-conditional world model that generates future visual observations based on an agent's predicted actions, enabling realistic and controllable robotic inference. Building on prior architectures, EVAC introduces a multi-level action-conditioning mechanism and ray map encoding for dynamic multi-view image generation while expanding training data with diverse failure trajectories to improve generalization. As both a data engine and evaluator, EVAC augments human-collected trajectories into diverse datasets and generates realistic, action-conditioned video observations for policy testing, eliminating the need for physical robots or complex simulations. This approach significantly reduces costs while maintaining high fidelity in robotic manipulation evaluation. Extensive experiments validate the effectiveness of our method. Code, checkpoints, and datasets can be found at <https://annaj2178.github.io/EnerverseAC.github.io>.
Related papers
- ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z) - Evaluating Robot Policies in a World Model [54.874926065292904]
We investigate World-model-based Policy Evaluation (WPE)<n>WPE achieves high fidelity in mimicing robot arm movements as in real videos.<n>We show that WPE can serve as a starting point for evaluating robot policies before real-world deployment.
arXiv Detail & Related papers (2025-05-31T15:51:56Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - ConditionNET: Learning Preconditions and Effects for Execution Monitoring [9.64001633229156]
ConditionNET is an approach for learning the preconditions and effects of actions in a fully data-driven manner.<n>We show in experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks.<n>Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments.
arXiv Detail & Related papers (2025-02-03T09:00:45Z) - Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model [63.336123527432136]
We introduce Bench2Drive-R, a generative framework that enables reactive closed-loop evaluation.<n>Unlike existing video generative models for autonomous driving, the proposed designs are tailored for interactive simulation.<n>We compare the generation quality of Bench2Drive-R with existing generative models and achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-12-11T06:35:18Z) - Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z) - IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform.
We produce a closed-loop controller to reactively push objects in a continuous action space.
We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z) - CLOUD: Contrastive Learning of Unsupervised Dynamics [19.091886595825947]
We propose to learn forward and inverse dynamics in a fully unsupervised manner via contrastive estimation.
We demonstrate the efficacy of our approach across a variety of tasks including goal-directed planning and imitation from observations.
arXiv Detail & Related papers (2020-10-23T15:42:57Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.