POLTER: Policy Trajectory Ensemble Regularization for Unsupervised
Reinforcement Learning
- URL: http://arxiv.org/abs/2205.11357v3
- Date: Fri, 15 Dec 2023 19:57:01 GMT
- Title: POLTER: Policy Trajectory Ensemble Regularization for Unsupervised
Reinforcement Learning
- Authors: Frederik Schubert, Carolin Benjamins, Sebastian D\"ohler, Bodo
Rosenhahn, Marius Lindauer
- Abstract summary: We present POLTER - a method to regularize the pretraining that can be applied to any URL algorithm.
We evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains.
We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case.
- Score: 30.834631947104498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of Unsupervised Reinforcement Learning (URL) is to find a
reward-agnostic prior policy on a task domain, such that the sample-efficiency
on supervised downstream tasks is improved. Although agents initialized with
such a prior policy can achieve a significantly higher reward with fewer
samples when finetuned on the downstream task, it is still an open question how
an optimal pretrained prior policy can be achieved in practice. In this work,
we present POLTER (Policy Trajectory Ensemble Regularization) - a general
method to regularize the pretraining that can be applied to any URL algorithm
and is especially useful on data- and knowledge-based URL algorithms. It
utilizes an ensemble of policies that are discovered during pretraining and
moves the policy of the URL algorithm closer to its optimal prior. Our method
is based on a theoretical framework, and we analyze its practical effects on a
white-box benchmark, allowing us to study POLTER with full control. In our main
experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning
Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the
generality of our approach by improving the performance of a diverse set of
data- and knowledge-based URL algorithms by 19% on average and up to 40% in the
best case. Under a fair comparison with tuned baselines and tuned POLTER, we
establish a new state-of-the-art for model-free methods on the URLB.
Related papers
- Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies.
The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Towards an Understanding of Default Policies in Multitask Policy
Optimization [29.806071693039655]
Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms.
We take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization.
We then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
arXiv Detail & Related papers (2021-11-04T16:45:15Z) - Direct Random Search for Fine Tuning of Deep Reinforcement Learning
Policies [5.543220407902113]
We show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts.
Our results show that this method yields more consistent and higher performing agents on the environments we tested.
arXiv Detail & Related papers (2021-09-12T20:12:46Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL)
In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.