Discovered Policy Optimisation
- URL: http://arxiv.org/abs/2210.05639v2
- Date: Thu, 13 Oct 2022 02:57:26 GMT
- Title: Discovered Policy Optimisation
- Authors: Chris Lu, Jakub Grudzien Kuba, Alistair Letcher, Luke Metz, Christian
Schroeder de Witt, Jakob Foerster
- Abstract summary: We explore the Mirror Learning space by meta-learning a "drift" function.
We refer to the immediate result as Learnt Policy optimisation (LPO)
By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy optimisation (DPO)
- Score: 17.458523575470384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tremendous progress has been made in reinforcement learning (RL) over the
past decade. Most of these advancements came through the continual development
of new algorithms, which were designed using a combination of mathematical
derivations, intuitions, and experimentation. Such an approach of creating
algorithms manually is limited by human understanding and ingenuity. In
contrast, meta-learning provides a toolkit for automatic machine learning
method optimisation, potentially addressing this flaw. However, black-box
approaches which attempt to discover RL algorithms with minimal prior structure
have thus far not outperformed existing hand-crafted algorithms. Mirror
Learning, which includes RL algorithms, such as PPO, offers a potential
middle-ground starting point: while every method in this framework comes with
theoretical guarantees, components that differentiate them are subject to
design. In this paper we explore the Mirror Learning space by meta-learning a
"drift" function. We refer to the immediate result as Learnt Policy
Optimisation (LPO). By analysing LPO we gain original insights into policy
optimisation which we use to formulate a novel, closed-form RL algorithm,
Discovered Policy Optimisation (DPO). Our experiments in Brax environments
confirm state-of-the-art performance of LPO and DPO, as well as their transfer
to unseen settings.
Related papers
- DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field.
We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Efficient Reinforcement Learning via Decoupling Exploration and Utilization [6.305976803910899]
Reinforcement Learning (RL) has achieved remarkable success across multiple fields and applications, including gaming, robotics, and autonomous vehicles.
In this work, our aim is to train agent with efficient learning by decoupling exploration and utilization, so that agent can escaping the conundrum of suboptimal Solutions.
The above idea is implemented in the proposed OPARL (Optimistic and Pessimistic Actor Reinforcement Learning) algorithm.
arXiv Detail & Related papers (2023-12-26T09:03:23Z) - Discovering General Reinforcement Learning Algorithms with Adversarial
Environment Design [54.39859618450935]
We show that it is possible to meta-learn update rules, with the hope of discovering algorithms that can perform well on a wide range of RL tasks.
Despite impressive initial results from algorithms such as Learned Policy Gradient (LPG), there remains a gap when these algorithms are applied to unseen environments.
In this work, we examine how characteristics of the meta-supervised-training distribution impact the performance of these algorithms.
arXiv Detail & Related papers (2023-10-04T12:52:56Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Mirror Learning: A Unifying Framework of Policy Optimisation [1.6114012813668934]
General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL)
Many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge.
We show that virtually all SOTA algorithms for RL are instances of mirror learning.
arXiv Detail & Related papers (2022-01-07T09:16:03Z) - On Multi-objective Policy Optimization as a Tool for Reinforcement
Learning: Case Studies in Offline RL and Finetuning [24.264618706734012]
We show how to develop novel and more effective deep reinforcement learning algorithms.
We focus on offline RL and finetuning as case studies.
We introduce Distillation of a Mixture of Experts (DiME)
We demonstrate that for offline RL, DiME leads to a simple new algorithm that outperforms state-of-the-art.
arXiv Detail & Related papers (2021-06-15T14:59:14Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms.
The learned algorithms are domain-agnostic and can generalize to new environments not seen during training.
We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z) - Discovering Reinforcement Learning Algorithms [53.72358280495428]
Reinforcement learning algorithms update an agent's parameters according to one of several possible rules.
This paper introduces a new meta-learning approach that discovers an entire update rule.
It includes both 'what to predict' (e.g. value functions) and 'how to learn from it' by interacting with a set of environments.
arXiv Detail & Related papers (2020-07-17T07:38:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.