On Multi-objective Policy Optimization as a Tool for Reinforcement
Learning: Case Studies in Offline RL and Finetuning
- URL: http://arxiv.org/abs/2106.08199v2
- Date: Tue, 1 Aug 2023 12:02:58 GMT
- Title: On Multi-objective Policy Optimization as a Tool for Reinforcement
Learning: Case Studies in Offline RL and Finetuning
- Authors: Abbas Abdolmaleki, Sandy H. Huang, Giulia Vezzani, Bobak Shahriari,
Jost Tobias Springenberg, Shruti Mishra, Dhruva TB, Arunkumar Byravan,
Konstantinos Bousmalis, Andras Gyorgy, Csaba Szepesvari, Raia Hadsell,
Nicolas Heess, Martin Riedmiller
- Abstract summary: We show how to develop novel and more effective deep reinforcement learning algorithms.
We focus on offline RL and finetuning as case studies.
We introduce Distillation of a Mixture of Experts (DiME)
We demonstrate that for offline RL, DiME leads to a simple new algorithm that outperforms state-of-the-art.
- Score: 24.264618706734012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many advances that have improved the robustness and efficiency of deep
reinforcement learning (RL) algorithms can, in one way or another, be
understood as introducing additional objectives or constraints in the policy
optimization step. This includes ideas as far ranging as exploration bonuses,
entropy regularization, and regularization toward teachers or data priors.
Often, the task reward and auxiliary objectives are in conflict, and in this
paper we argue that this makes it natural to treat these cases as instances of
multi-objective (MO) optimization problems. We demonstrate how this perspective
allows us to develop novel and more effective RL algorithms. In particular, we
focus on offline RL and finetuning as case studies, and show that existing
approaches can be understood as MO algorithms relying on linear scalarization.
We hypothesize that replacing linear scalarization with a better algorithm can
improve performance. We introduce Distillation of a Mixture of Experts (DiME),
a new MORL algorithm that outperforms linear scalarization and can be applied
to these non-standard MO problems. We demonstrate that for offline RL, DiME
leads to a simple new algorithm that outperforms state-of-the-art. For
finetuning, we derive new algorithms that learn to outperform the teacher
policy.
Related papers
- Offline reinforcement learning for job-shop scheduling problems [1.3927943269211593]
This paper introduces a novel offline RL method designed for optimization problems with complex constraints.
Our approach encodes actions in edge attributes and balances expected rewards with the imitation of expert solutions.
We demonstrate the effectiveness of this method on job-shop scheduling and flexible job-shop scheduling benchmarks.
arXiv Detail & Related papers (2024-10-21T07:33:42Z) - Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling [0.0]
One promising approach is to train an RL agent as an improvement, starting with a suboptimal solution that is iteratively improved by applying small changes.
We apply this approach to a real-world multiobjective production scheduling problem.
We benchmarked our approach against other approaches using real data from our industry partner, demonstrating its superior performance.
arXiv Detail & Related papers (2024-09-18T12:48:56Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - A Minimalist Approach to Offline Reinforcement Learning [10.904148149681932]
offline reinforcement learning defines the task of learning from a fixed batch of data.
In this paper we aim to make a deep RL algorithm work while making minimal changes.
We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm.
arXiv Detail & Related papers (2021-06-12T20:38:59Z) - FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance
Metric Learning and Behavior Regularization [10.243908145832394]
We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks.
This problem is still not fully understood, for which two major challenges need to be addressed.
We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches.
arXiv Detail & Related papers (2020-10-02T17:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.