Distributionally-Constrained Policy Optimization via Unbalanced Optimal
Transport
- URL: http://arxiv.org/abs/2102.07889v1
- Date: Mon, 15 Feb 2021 23:04:37 GMT
- Title: Distributionally-Constrained Policy Optimization via Unbalanced Optimal
Transport
- Authors: Arash Givchi, Pei Wang, Junqi Wang, Patrick Shafto
- Abstract summary: We formulate policy optimization as unbalanced optimal transport over the space of occupancy measures.
We propose a general purpose RL objective based on Bregman divergence and optimize it using Dykstra's algorithm.
- Score: 15.294456568539148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider constrained policy optimization in Reinforcement Learning, where
the constraints are in form of marginals on state visitations and global action
executions. Given these distributions, we formulate policy optimization as
unbalanced optimal transport over the space of occupancy measures. We propose a
general purpose RL objective based on Bregman divergence and optimize it using
Dykstra's algorithm. The approach admits an actor-critic algorithm for when the
state or action space is large, and only samples from the marginals are
available. We discuss applications of our approach and provide demonstrations
to show the effectiveness of our algorithm.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Trust Region Policy Optimization with Optimal Transport Discrepancies:
Duality and Algorithm for Continuous Actions [5.820284464296154]
Trust Region Policy Optimization is a popular approach to stabilize the policy updates.
We propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces.
Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.
arXiv Detail & Related papers (2022-10-20T10:04:35Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class.
We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies.
Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z) - A Kernel Mean Embedding Approach to Reducing Conservativeness in
Stochastic Programming and Control [13.739881592455044]
We apply kernel mean embedding methods to sample-based optimization and control.
The effect of such constraint removal is improved optimality and decreased conservativeness.
arXiv Detail & Related papers (2020-01-28T15:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.