Revisiting Design Choices in Proximal Policy Optimization
- URL: http://arxiv.org/abs/2009.10897v1
- Date: Wed, 23 Sep 2020 02:00:34 GMT
- Title: Revisiting Design Choices in Proximal Policy Optimization
- Authors: Chloe Ching-Yun Hsu, Celestine Mendler-D\"unner, Moritz Hardt
- Abstract summary: Proximal Policy Optimization (PPO) is a popular deep policy algorithm gradient.
These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks.
We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO.
- Score: 21.721075405670916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proximal Policy Optimization (PPO) is a popular deep policy gradient
algorithm. In standard implementations, PPO regularizes policy updates with
clipped probability ratios, and parameterizes policies with either continuous
Gaussian distributions or discrete Softmax distributions. These design choices
are widely accepted, and motivated by empirical performance comparisons on
MuJoCo and Atari benchmarks.
We revisit these practices outside the regime of current benchmarks, and
expose three failure modes of standard PPO. We explain why standard design
choices are problematic in these cases, and show that alternative choices of
surrogate objectives and policy parameterizations can prevent the failure
modes. We hope that our work serves as a reminder that many algorithmic design
choices in reinforcement learning are tied to specific simulation environments.
We should not implicitly accept these choices as a standard part of a more
general algorithm.
Related papers
- SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments [18.081732498034047]
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options.
Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem.
arXiv Detail & Related papers (2024-07-26T17:59:55Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Truly Deterministic Policy Optimization [3.07015565161719]
We present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape.
We show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic.
arXiv Detail & Related papers (2022-05-30T18:49:33Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z) - Strengthening Deterministic Policies for POMDPs [5.092711491848192]
We provide a novel MILP encoding that supports sophisticated specifications in the form of temporal logic constraints.
We employ a preprocessing of the POMDP to encompass memory-based decisions.
The advantages of our approach lie (1) in the flexibility to strengthen simple deterministic policies without losing computational tractability and (2) in the ability to enforce the provable satisfaction of arbitrarily many specifications.
arXiv Detail & Related papers (2020-07-16T14:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.