Enter the Void - Planning to Seek Entropy When Reward is Scarce
- URL: http://arxiv.org/abs/2505.16787v2
- Date: Tue, 27 May 2025 12:28:55 GMT
- Title: Enter the Void - Planning to Seek Entropy When Reward is Scarce
- Authors: Ashish Sundar, Chunbo Luo, Xiaoyang Wang,
- Abstract summary: We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions.<n>We present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy.<n>Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.
- Score: 6.208654695856247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. MBRL methods have progressed by largely prioritising the actor; optimising the world model learning has been neglected meanwhile. Improving the fidelity of the world model and reducing its time to convergence can yield significant downstream benefits, one of which is improving the ensuing performance of any actor it may train. We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions generated by the world model, offering a principled alternative to traditional curiosity-driven methods that chase once-novel states well after they were stumbled into. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multi step plans after every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to just Dreamer as a proof of concept. Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.
Related papers
- SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model [88.04128601981145]
We introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning.<n>modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation.<n>World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning.
arXiv Detail & Related papers (2025-07-31T17:57:20Z) - Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - Deep Active Inference Agents for Delayed and Long-Horizon Environments [1.693200946453174]
AIF agents rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments.<n>We propose a generative-policy architecture featuring a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead.<n>We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long-horizon settings.
arXiv Detail & Related papers (2025-05-26T11:50:22Z) - Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning [2.5749046466046903]
In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions.<n>We show that performing the dreaming process inside the latent space allows for training with fewer environment steps.<n>We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.
arXiv Detail & Related papers (2025-02-28T15:24:17Z) - Accelerating Model-Based Reinforcement Learning with State-Space World Models [18.71404724458449]
Reinforcement learning (RL) is a powerful approach for robot learning.<n>However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies.<n>We propose a new method for accelerating model-based RL using state-space world models.
arXiv Detail & Related papers (2025-02-27T15:05:25Z) - Predictive Planner for Autonomous Driving with Consistency Models [5.966385886363771]
Trajectory prediction and planning are essential for autonomous vehicles to navigate safely and efficiently in dynamic environments.<n>Recent diffusion-based generative models have shown promise in multi-agent trajectory generation, but their slow sampling is less suitable for high-frequency planning tasks.<n>We leverage the consistency model to build a predictive planner that samples from a joint distribution of ego and surrounding agents, conditioned on the ego vehicle's navigational goal.
arXiv Detail & Related papers (2025-02-12T00:26:01Z) - Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining [49.730897226510095]
We introduce JOWA: Jointly-Reinforced World-Action model, an offline model-based RL agent pretrained on Atari games with 6 billion tokens data.<n>Our largest agent, with 150 million parameters, 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange.
arXiv Detail & Related papers (2024-10-01T10:25:03Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - World Models via Policy-Guided Trajectory Diffusion [21.89154719069519]
Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy.
We propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model.
arXiv Detail & Related papers (2023-12-13T21:46:09Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Evolutionary Planning in Latent Space [7.863826008567604]
Planning is a powerful approach to reinforcement learning with several desirable properties.
We learn a world model that enables Evolutionary Planning in Latent Space.
We show how to build a model of the world by bootstrapping it with rollouts from a random policy and iteratively refining it with rollouts from an increasingly accurate planning policy.
arXiv Detail & Related papers (2020-11-23T09:21:30Z) - Ready Policy One: World Building Through Active Learning [35.358315617358976]
We introduce Ready Policy One (RP1), a framework that views Model-Based Reinforcement Learning as an active learning problem.
RP1 achieves this by utilizing a hybrid objective function, which crucially adapts during optimization.
We rigorously evaluate our method on a variety of continuous control tasks, and demonstrate statistically significant gains over existing approaches.
arXiv Detail & Related papers (2020-02-07T09:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.