Novel Policy Seeking with Constrained Optimization
- URL: http://arxiv.org/abs/2005.10696v3
- Date: Sat, 29 Oct 2022 16:49:53 GMT
- Title: Novel Policy Seeking with Constrained Optimization
- Authors: Hao Sun, Zhenghao Peng, Bo Dai, Jian Guo, Dahua Lin, Bolei Zhou
- Abstract summary: We propose to rethink the problem of generating novel policies in reinforcement learning tasks.
We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods.
The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature.
- Score: 131.67409598529287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In problem-solving, we humans can come up with multiple novel solutions to
the same problem. However, reinforcement learning algorithms can only produce a
set of monotonous policies that maximize the cumulative reward but lack
diversity and novelty. In this work, we address the problem of generating novel
policies in reinforcement learning tasks. Instead of following the
multi-objective framework used in existing methods, we propose to rethink the
problem under a novel perspective of constrained optimization. We first
introduce a new metric to evaluate the difference between policies and then
design two practical novel policy generation methods following the new
perspective. The two proposed methods, namely the Constrained Task Novel
Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from
the feasible direction method and the interior point method commonly known in
the constrained optimization literature. Experimental comparisons on the MuJoCo
control suite show our methods can achieve substantial improvement over
previous novelty-seeking methods in terms of both the novelty of policies and
their performances in the primal task.
Related papers
- Control in Stochastic Environment with Delays: A Model-based
Reinforcement Learning Approach [3.130722489512822]
We introduce a new reinforcement learning method for control problems in environments with delayed feedback.
Specifically, our method employs planning, versus previous methods that used deterministic planning.
We show that this formulation can recover the optimal policy for problems with deterministic transitions.
arXiv Detail & Related papers (2024-02-01T03:53:56Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Sample-Efficient Multi-Objective Learning via Generalized Policy
Improvement Prioritization [8.836422771217084]
Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences.
We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes.
We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks.
arXiv Detail & Related papers (2023-01-18T20:54:40Z) - Optimistic Linear Support and Successor Features as a Basis for Optimal
Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set.
We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z) - Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - SOAC: The Soft Option Actor-Critic Architecture [25.198302636265286]
Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy.
Existing methods typically suffer from two major challenges: ineffective exploration and unstable updates.
We present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges.
arXiv Detail & Related papers (2020-06-25T13:06:59Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.