BRPO: Batch Residual Policy Optimization
- URL: http://arxiv.org/abs/2002.05522v2
- Date: Sun, 29 Mar 2020 00:45:13 GMT
- Title: BRPO: Batch Residual Policy Optimization
- Authors: Sungryull Sohn and Yinlam Chow and Jayden Ooi and Ofir Nachum and
Honglak Lee and Ed Chi and Craig Boutilier
- Abstract summary: In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
- Score: 79.53696635382592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In batch reinforcement learning (RL), one often constrains a learned policy
to be close to the behavior (data-generating) policy, e.g., by constraining the
learned action distribution to differ from the behavior policy by some maximum
degree that is the same at each state. This can cause batch RL to be overly
conservative, unable to exploit large policy changes at frequently-visited,
high-confidence states without risking poor performance at sparsely-visited
states. To remedy this, we propose residual policies, where the allowable
deviation of the learned policy is state-action-dependent. We derive a new for
RL method, BRPO, which learns both the policy and allowable deviation that
jointly maximize a lower bound on policy performance. We show that BRPO
achieves the state-of-the-art performance in a number of tasks.
Related papers
- IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Counterfactual Explanation Policies in RL [3.674863913115432]
COUNTERPOL is the first framework to analyze Reinforcement Learning policies using counterfactual explanations.
We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL.
arXiv Detail & Related papers (2023-07-25T01:14:56Z) - Policy Regularization with Dataset Constraint for Offline Reinforcement
Learning [27.868687398300658]
We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL)
In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with dataset Constraint (PRDC)
PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state.
arXiv Detail & Related papers (2023-06-11T03:02:10Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - General Policy Evaluation and Improvement by Learning to Identify Few
But Crucial States [12.059140532198064]
Learning to evaluate and improve policies is a core problem of Reinforcement Learning.
A recently explored competitive alternative is to learn a single value function for many policies.
We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
arXiv Detail & Related papers (2022-07-04T16:34:53Z) - Towards an Understanding of Default Policies in Multitask Policy
Optimization [29.806071693039655]
Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms.
We take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization.
We then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
arXiv Detail & Related papers (2021-11-04T16:45:15Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.