Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL
- URL: http://arxiv.org/abs/2305.00567v1
- Date: Sun, 30 Apr 2023 20:15:26 GMT
- Title: Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL
- Authors: Baiting Zhu, Meihua Dang, Aditya Grover
- Abstract summary: The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives.
We propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent.
PEDA is a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy.
- Score: 22.468486569700236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of multi-objective reinforcement learning (MORL) is to learn
policies that simultaneously optimize multiple competing objectives. In
practice, an agent's preferences over the objectives may not be known apriori,
and hence, we require policies that can generalize to arbitrary preferences at
test time. In this work, we propose a new data-driven setup for offline MORL,
where we wish to learn a preference-agnostic policy agent using only a finite
dataset of offline demonstrations of other agents and their preferences. The
key contributions of this work are two-fold. First, we introduce D4MORL,
(D)atasets for MORL that are specifically designed for offline settings. It
contains 1.8 million annotated demonstrations obtained by rolling out reference
policies that optimize for randomly sampled preferences on 6 MuJoCo
environments with 2-3 objectives each. Second, we propose Pareto-Efficient
Decision Agents (PEDA), a family of offline MORL algorithms that builds and
extends Decision Transformers via a novel preference-and-return-conditioned
policy. Empirically, we show that PEDA closely approximates the behavioral
policy on the D4MORL benchmark and provides an excellent approximation of the
Pareto-front with appropriate conditioning, as measured by the hypervolume and
sparsity metrics.
Related papers
- C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front [9.04360155372014]
Constrained MORL is a seamless bridge between constrained policy optimization and MORL.
Our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks.
arXiv Detail & Related papers (2024-10-03T06:13:56Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Generalized Multi-Objective Reinforcement Learning with Envelope Updates in URLLC-enabled Vehicular Networks [12.323383132739195]
We develop a novel multi-objective reinforcement learning framework to jointly optimize wireless network selection and autonomous driving policies.
The proposed framework is designed to maximize the traffic flow and minimize collisions by controlling the vehicle's motion dynamics.
The proposed policies enable autonomous vehicles to adopt safe driving behaviors with improved connectivity.
arXiv Detail & Related papers (2024-05-18T16:31:32Z) - Policy-regularized Offline Multi-objective Reinforcement Learning [11.58560880898882]
We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting.
We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness.
arXiv Detail & Related papers (2024-01-04T12:54:10Z) - Human-in-the-Loop Policy Optimization for Preference-Based
Multi-Objective Reinforcement Learning [13.627087954965695]
We propose a human-in-the-loop policy optimization framework for preference-based MORL.
Our method proactively learns the DM's implicit preference information without requiring any priori knowledge.
We evaluate our approach against three conventional MORL algorithms and four state-of-the-art preference-based MORL algorithms.
arXiv Detail & Related papers (2024-01-04T09:17:53Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning
Algorithm [0.18416014644193063]
We propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks.
PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
arXiv Detail & Related papers (2022-08-16T19:23:02Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.