Related papers: Revisiting Design Choices in Proximal Policy Optimization

Revisiting Design Choices in Proximal Policy Optimization

URL: http://arxiv.org/abs/2009.10897v1
Date: Wed, 23 Sep 2020 02:00:34 GMT
Title: Revisiting Design Choices in Proximal Policy Optimization
Authors: Chloe Ching-Yun Hsu, Celestine Mendler-D\"unner, Moritz Hardt
Abstract summary: Proximal Policy Optimization (PPO) is a popular deep policy algorithm gradient. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO.
Score: 21.721075405670916
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO. We explain why standard design choices are problematic in these cases, and show that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. We hope that our work serves as a reminder that many algorithmic design choices in reinforcement learning are tied to specific simulation environments. We should not implicitly accept these choices as a standard part of a more general algorithm.

Related papers

Enhancing PPO with Trajectory-Aware Hybrid Policies [6.938941097426891]
Proximal policy optimization (PPO) is one of the most popular state-of-the-art on-policy algorithms. High variance, and high sample complexity still remain critical challenges in on-policy algorithms. We propose Hybrid-Policy Proximal Policy Optimization (HP3O), which utilizes a trajectory replay buffer to make efficient use of trajectories generated by recent policies.
arXiv Detail & Related papers (2025-02-21T22:00:13Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy. Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments [18.081732498034047]
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem.
arXiv Detail & Related papers (2024-07-26T17:59:55Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. We propose a single framework built on their equivalence in learning scenarios. Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data. Our analysis allows for the sampling of entire episodes, as is common practice in most applications. Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z)
Truly Deterministic Policy Optimization [3.07015565161719]
We present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. We show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic.
arXiv Detail & Related papers (2022-05-30T18:49:33Z)
Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z)
On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z)
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover) We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions. We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z)
Strengthening Deterministic Policies for POMDPs [5.092711491848192]
We provide a novel MILP encoding that supports sophisticated specifications in the form of temporal logic constraints. We employ a preprocessing of the POMDP to encompass memory-based decisions. The advantages of our approach lie (1) in the flexibility to strengthen simple deterministic policies without losing computational tractability and (2) in the ability to enforce the provable satisfaction of arbitrarily many specifications.
arXiv Detail & Related papers (2020-07-16T14:22:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.