Related papers: Mirror Learning: A Unifying Framework of Policy Optimisation

Mirror Learning: A Unifying Framework of Policy Optimisation

URL: http://arxiv.org/abs/2201.02373v2
Date: Tue, 11 Jan 2022 15:14:09 GMT
Title: Mirror Learning: A Unifying Framework of Policy Optimisation
Authors: Jakub Grudzien Kuba, Christian Schroeder de Witt, Jakob Foerster
Abstract summary: General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL) Many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. We show that virtually all SOTA algorithms for RL are instances of mirror learning.
Score: 1.6114012813668934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose \textsl{mirror learning} -- a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.

Related papers

StaQ it! Growing neural networks for Policy Mirror Descent [4.672862669694739]
In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice.<n>We propose and analyze PMD-like algorithms that only keep the last $M$ Q-functions in memory.<n>We show that for finite and large enough $M$, a convergent algorithm can be derived, introducing no error in the policy update.
arXiv Detail & Related papers (2025-06-16T18:00:01Z)
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning [55.889740979706815]
Reinforcement Learning (RL) has played a central role in the recent surge of math abilities by enabling self-improvement through binary verifier signals.<n>In this work, we propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers.
arXiv Detail & Related papers (2025-05-23T17:17:40Z)
Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning [27.868175900131313]
Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state. This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy. We leverage the PARAFAC decomposition to design tensor low-rank policies.
arXiv Detail & Related papers (2025-01-08T23:22:08Z)
Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL) We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning? A Theoretical Perspective [55.36819597141271]
Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an emphexpert policy -- plays a critical role in developing intelligent systems. This paper provides the first line of efficient IRL in vanilla offline and online settings using samples and runtime. As an application, we show that the learned rewards can emphtransfer to another target MDP with suitable guarantees.
arXiv Detail & Related papers (2023-11-29T00:09:01Z)
Discovered Policy Optimisation [17.458523575470384]
We explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy optimisation (LPO) By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy optimisation (DPO)
arXiv Detail & Related papers (2022-10-11T17:32:11Z)
Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL) We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z)
Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning. We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
Mirror Descent Policy Optimization [41.46894905097985]
We propose an efficient RL algorithm, called em mirror descent policy optimization (MDPO) MDPO iteratively updates the policy by em approximately solving a trust-region problem. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact em not a necessity for high performance gains in TRPO.
arXiv Detail & Related papers (2020-05-20T01:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.