Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures
- URL: http://arxiv.org/abs/2412.06655v1
- Date: Mon, 09 Dec 2024 16:56:06 GMT
- Title: Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures
- Authors: Adrien Bolland, Gaspard Lambrechts, Damien Ernst,
- Abstract summary: We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy.<n>For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions visited during the next time steps.
- Score: 1.75493501156941
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.
Related papers
- SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation [54.537828696303286]
In unsupervised-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions.<n>We focus on state entropy (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution.<n>We introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset.
arXiv Detail & Related papers (2025-12-10T19:50:21Z) - Achieve Performatively Optimal Policy for Performative Reinforcement Learning [55.983627302691424]
This work proposes a zeroth-order FrankWolfe- (0FW) algorithm with a gradient of performative policy in the framework.<n> Experimental results demonstrate that our 0FW is more effective than the existing approximation in finding the desired PO policy.
arXiv Detail & Related papers (2025-10-06T01:56:31Z) - Scalable Submodular Policy Optimization via Pruned Submodularity Graph [2.8672152503836]
In Reinforcement Learning (abbreviated as RL), an agent interacts with the environment via a set of possible actions, and a reward is generated from some unknown distribution.<n>The task here is to find an optimal set of actions such that the reward after a certain time step gets maximized.
arXiv Detail & Related papers (2025-07-18T11:42:07Z) - Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation [0.276240219662896]
A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy.
This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes.
This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings.
arXiv Detail & Related papers (2024-07-25T15:48:24Z) - The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough [40.82741665804367]
We study a simple approach of maximizing the entropy over observations in place true latent states.
We show how knowledge of the latter can be exploited to compute a regularization of the observation entropy to improve principled performance.
arXiv Detail & Related papers (2024-06-18T17:00:13Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - A Novel Variational Lower Bound for Inverse Reinforcement Learning [5.370126167091961]
Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories.
We present a new Variational Lower Bound for IRL (VLB-IRL)
Our method simultaneously learns the reward function and policy under the learned reward function.
arXiv Detail & Related papers (2023-11-07T03:50:43Z) - Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration [97.19464604735802]
A promising technique for exploration is to maximize the entropy of visited state distribution.
It tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states.
We present a novel exploration technique that maximizes the value-conditional state entropy.
arXiv Detail & Related papers (2023-05-31T01:09:28Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC)
Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State
Entropy Estimate [40.97686031763918]
In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy?
We argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target.
We present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy.
arXiv Detail & Related papers (2020-07-09T08:44:39Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z) - Estimating Q(s,s') with Deep Deterministic Dynamics Gradients [25.200259376015744]
We introduce a novel form of value function, $Q(s, s')$, that expresses the utility of transitioning from a state $s$ to a neighboring state $s'$.
In order to derive an optimal policy, we develop a forward dynamics model that learns to make next-state predictions that maximize this value.
arXiv Detail & Related papers (2020-02-21T19:05:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.