EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
- URL: http://arxiv.org/abs/2510.27545v1
- Date: Fri, 31 Oct 2025 15:21:05 GMT
- Title: EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
- Authors: Travis Davies, Yiqi Huang, Alexi Gladstone, Yunxin Liu, Xiang Chen, Heng Ji, Huxian Liu, Luhui Hu,
- Abstract summary: Implicit policies parameterized by generative models, such as Diffusion Policy, often suffer from high computational cost, exposure bias, and unstable inference dynamics.<n>We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings.<n>EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation.
- Score: 41.02333103120137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
Related papers
- Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning [13.92596311376194]
We train conditional energy-based transition models using a manifold projection--diffusion negative sampler.<n>MC-ETM learns a latent manifold of next states and generates near-manifold hard negatives.<n>We formalize MC-ETM through a hybrid pessimistic MDP formulation and derive a conservative performance bound separating in-support evaluation error from truncation risk.
arXiv Detail & Related papers (2026-02-02T23:15:43Z) - Aligning Agentic World Models via Knowledgeable Experience Learning [68.85843641222186]
We introduce WorldMind, a framework that constructs a symbolic World Knowledge Repository by synthesizing environmental feedback.<n>WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.
arXiv Detail & Related papers (2026-01-19T17:33:31Z) - Policy-Based Reinforcement Learning with Action Masking for Dynamic Job Shop Scheduling under Uncertainty: Handling Random Arrivals and Machine Failures [3.2880869992413246]
We present a novel framework for solving Dynamic Job Shop Scheduling Problems under uncertainty.<n>Our approach follows a model-based paradigm, using Coloured Timed Petri Nets to represent the scheduling environment.<n>We conduct experiments on dynamic JSSP benchmarks, demonstrating that our method consistently outperforms traditional minimization and rule-based approaches in terms of makespan.
arXiv Detail & Related papers (2026-01-14T08:53:46Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator [50.191655141020505]
Reinforcement Learning (RL) has demonstrated impressive capabilities in robotic control but remains challenging due to high sample complexity, safety concerns, and the sim-to-real gap.<n>We introduce Offline Robotic World Model (RWM-O), a model-based approach that explicitly estimates uncertainty to improve policy learning without reliance on a physics simulator.
arXiv Detail & Related papers (2025-04-23T12:58:15Z) - Enhancing Cyber-Resilience in Integrated Energy System Scheduling with Demand Response Using Deep Reinforcement Learning [11.223780653355437]
This paper proposes an innovative model-free resilience scheduling method based on state-adversarial deep reinforcement learning (DRL)
The proposed method designs an IDR program to explore the interaction ability of electricity-gas-heat flexible loads.
The state-adversarial soft actor-critic (SA-SAC) algorithm is proposed to mitigate the impact of cyber-attacks on the scheduling strategy.
arXiv Detail & Related papers (2023-11-28T23:29:36Z) - Revisiting Energy Based Models as Policies: Ranking Noise Contrastive
Estimation and Interpolating Energy Models [18.949193683555237]
In this work, we revisit the choice of energy-based models (EBM) as a policy class.
We develop a training objective and algorithm for energy models which combines several key ingredients.
We show that the Implicit Behavior Cloning (IBC) objective is actually biased even at the population level.
arXiv Detail & Related papers (2023-09-11T20:13:47Z) - Learning Robust Policy against Disturbance in Transition Dynamics via
State-Conservative Policy Optimization [63.75188254377202]
Deep reinforcement learning algorithms can perform poorly in real-world tasks due to discrepancy between source and target environments.
We propose a novel model-free actor-critic algorithm to learn robust policies without modeling the disturbance in advance.
Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
arXiv Detail & Related papers (2021-12-20T13:13:05Z) - Enforcing Policy Feasibility Constraints through Differentiable
Projection for Energy Optimization [57.88118988775461]
We propose PROjected Feasibility (PROF) to enforce convex operational constraints within neural policies.
We demonstrate PROF on two applications: energy-efficient building operation and inverter control.
arXiv Detail & Related papers (2021-05-19T01:58:10Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.