Policy-regularized Offline Multi-objective Reinforcement Learning
- URL: http://arxiv.org/abs/2401.02244v1
- Date: Thu, 4 Jan 2024 12:54:10 GMT
- Title: Policy-regularized Offline Multi-objective Reinforcement Learning
- Authors: Qian Lin, Chao Yu, Zongkai Liu, Zifan Wu
- Abstract summary: We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting.
We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness.
- Score: 11.58560880898882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we aim to utilize only offline trajectory data to train a
policy for multi-objective RL. We extend the offline policy-regularized method,
a widely-adopted approach for single-objective offline RL problems, into the
multi-objective setting in order to achieve the above goal. However, such
methods face a new challenge in offline MORL settings, namely the
preference-inconsistent demonstration problem. We propose two solutions to this
problem: 1) filtering out preference-inconsistent demonstrations via
approximating behavior preferences, and 2) adopting regularization techniques
with high policy expressiveness. Moreover, we integrate the
preference-conditioned scalarized update method into policy-regularized offline
RL, in order to simultaneously learn a set of policies using a single policy
network, thus reducing the computational cost induced by the training of a
large number of individual policies for various preferences. Finally, we
introduce Regularization Weight Adaptation to dynamically determine appropriate
regularization weights for arbitrary target preferences during deployment.
Empirical results on various multi-objective datasets demonstrate the
capability of our approach in solving offline MORL problems.
Related papers
- Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Learning Control Policies for Variable Objectives from Offline Data [2.7174376960271154]
We introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP)
We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime.
arXiv Detail & Related papers (2023-08-11T13:33:59Z) - Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL [22.468486569700236]
The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives.
We propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent.
PEDA is a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy.
arXiv Detail & Related papers (2023-04-30T20:15:26Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning
Algorithm [0.18416014644193063]
We propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks.
PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
arXiv Detail & Related papers (2022-08-16T19:23:02Z) - gTLO: A Generalized and Non-linear Multi-Objective Deep Reinforcement
Learning Approach [2.0305676256390934]
Generalized Thresholded Lexicographic Ordering (gTLO) is a novel method that aims to combine non-linear MORL with the advantages of generalized MORL.
We present promising results on a standard benchmark for non-linear MORL and a real-world application from the domain of manufacturing process control.
arXiv Detail & Related papers (2022-04-11T10:06:49Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - Model-Based Offline Meta-Reinforcement Learning with Regularization [63.35040401948943]
offline Meta-RL is emerging as a promising approach to address these challenges.
MerPO learns a meta-model for efficient task structure inference and an informative meta-policy.
We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
arXiv Detail & Related papers (2022-02-07T04:15:20Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.