Related papers: Novel Policy Seeking with Constrained Optimization

Novel Policy Seeking with Constrained Optimization

URL: http://arxiv.org/abs/2005.10696v3
Date: Sat, 29 Oct 2022 16:49:53 GMT
Title: Novel Policy Seeking with Constrained Optimization
Authors: Hao Sun, Zhenghao Peng, Bo Dai, Jian Guo, Dahua Lin, Bolei Zhou
Abstract summary: We propose to rethink the problem of generating novel policies in reinforcement learning tasks. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature.
Score: 131.67409598529287
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods following the new perspective. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJoCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task.

Related papers

Policy Testing in Markov Decision Processes [48.642181362172906]
We study the policy testing problem in discounted decision processes (MDP) under the fixed-confidence setting.<n>The goal is to determine whether the value of a given policy exceeds a numerical threshold.
arXiv Detail & Related papers (2025-05-21T10:13:54Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach [3.130722489512822]
We introduce a new reinforcement learning method for control problems in environments with delayed feedback. Specifically, our method employs planning, versus previous methods that used deterministic planning. We show that this formulation can recover the optimal policy for problems with deterministic transitions.
arXiv Detail & Related papers (2024-02-01T03:53:56Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change. We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z)
Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization [8.836422771217084]
Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks.
arXiv Detail & Related papers (2023-01-18T20:54:40Z)
Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z)
Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning. We form a second trust region through the construction of another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z)
State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z)
SOAC: The Soft Option Actor-Critic Architecture [25.198302636265286]
Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. Existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. We present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges.
arXiv Detail & Related papers (2020-06-25T13:06:59Z)
Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents. We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively. We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.