Mirror Learning: A Unifying Framework of Policy Optimisation
- URL: http://arxiv.org/abs/2201.02373v2
- Date: Tue, 11 Jan 2022 15:14:09 GMT
- Title: Mirror Learning: A Unifying Framework of Policy Optimisation
- Authors: Jakub Grudzien Kuba, Christian Schroeder de Witt, Jakob Foerster
- Abstract summary: General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL)
Many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge.
We show that virtually all SOTA algorithms for RL are instances of mirror learning.
- Score: 1.6114012813668934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General policy improvement (GPI) and trust-region learning (TRL) are the
predominant frameworks within contemporary reinforcement learning (RL), which
serve as the core models for solving Markov decision processes (MDPs).
Unfortunately, in their mathematical form, they are sensitive to modifications,
and thus, the practical instantiations that implement them do not automatically
inherit their improvement guarantees. As a result, the spectrum of available
rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA)
algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we
propose \textsl{mirror learning} -- a general solution to the RL problem. We
reveal GPI and TRL to be but small points within this far greater space of
algorithms which boasts the monotonic improvement property and converges to the
optimal policy. We show that virtually all SOTA algorithms for RL are instances
of mirror learning, and thus suggest that their empirical performance is a
consequence of their theoretical properties, rather than of approximate
analogies. Excitingly, we show that mirror learning opens up a whole new space
of policy learning methods with convergence guarantees.
Related papers
- REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Is Inverse Reinforcement Learning Harder than Standard Reinforcement
Learning? A Theoretical Perspective [55.36819597141271]
Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an emphexpert policy -- plays a critical role in developing intelligent systems.
This paper provides the first line of efficient IRL in vanilla offline and online settings using samples and runtime.
As an application, we show that the learned rewards can emphtransfer to another target MDP with suitable guarantees.
arXiv Detail & Related papers (2023-11-29T00:09:01Z) - Discovered Policy Optimisation [17.458523575470384]
We explore the Mirror Learning space by meta-learning a "drift" function.
We refer to the immediate result as Learnt Policy optimisation (LPO)
By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy optimisation (DPO)
arXiv Detail & Related papers (2022-10-11T17:32:11Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Mirror Descent Policy Optimization [41.46894905097985]
We propose an efficient RL algorithm, called em mirror descent policy optimization (MDPO)
MDPO iteratively updates the policy by em approximately solving a trust-region problem.
We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact em not a necessity for high performance gains in TRPO.
arXiv Detail & Related papers (2020-05-20T01:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.