Non-Markovian policies occupancy measures
- URL: http://arxiv.org/abs/2205.13950v1
- Date: Fri, 27 May 2022 12:49:33 GMT
- Title: Non-Markovian policies occupancy measures
- Authors: Romain Laroche, Remi Tachet des Combes, Jacob Buckman
- Abstract summary: A central object of study in Reinforcement Learning (RL) is the Markovian policy, in which an agent's actions are chosen from a memoryless probability distribution.
Our main contribution is to prove that the occupancy measure of any non-Markovian policy can be equivalently generated by a Markovian policy.
This result allows theorems about the Markovian policy class to be directly extended to its non-Markovian counterpart.
- Score: 23.855882145667767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A central object of study in Reinforcement Learning (RL) is the Markovian
policy, in which an agent's actions are chosen from a memoryless probability
distribution, conditioned only on its current state. The family of Markovian
policies is broad enough to be interesting, yet simple enough to be amenable to
analysis. However, RL often involves more complex policies: ensembles of
policies, policies over options, policies updated online, etc. Our main
contribution is to prove that the occupancy measure of any non-Markovian
policy, i.e., the distribution of transition samples collected with it, can be
equivalently generated by a Markovian policy.
This result allows theorems about the Markovian policy class to be directly
extended to its non-Markovian counterpart, greatly simplifying proofs, in
particular those involving replay buffers and datasets. We provide various
examples of such applications to the field of Reinforcement Learning.
Related papers
- Sample Complexity Reduction via Policy Difference Estimation in Tabular Reinforcement Learning [8.182196998385582]
Existing work in bandits has shown that it is possible to identify the best policy by estimating only the difference between the behaviors of individual policies.
However, the best-known complexities in RL fail to take advantage of this and instead estimate the behavior of each policy directly.
We show that it almost suffices to estimate only the differences in RL: if we can estimate the behavior of a single reference policy, it suffices to only estimate how any other policy deviates from this reference policy.
arXiv Detail & Related papers (2024-06-11T00:02:19Z) - Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.