DGPO: Discovering Multiple Strategies with Diversity-Guided Policy
Optimization
- URL: http://arxiv.org/abs/2207.05631v3
- Date: Fri, 5 Jan 2024 21:45:40 GMT
- Title: DGPO: Discovering Multiple Strategies with Diversity-Guided Policy
Optimization
- Authors: Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting
Chen, Jun Zhu
- Abstract summary: We propose an on-policy algorithm that discovers multiple strategies for solving a given task.
Unlike prior work, it achieves this with a shared policy network trained over a single run.
Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks.
- Score: 34.40615558867965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most reinforcement learning algorithms seek a single optimal strategy that
solves a given task. However, it can often be valuable to learn a diverse set
of solutions, for instance, to make an agent's interaction with users more
engaging, or improve the robustness of a policy to an unexpected perturbance.
We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm
that discovers multiple strategies for solving a given task. Unlike prior work,
it achieves this with a shared policy network trained over a single run.
Specifically, we design an intrinsic reward based on an information-theoretic
diversity objective. Our final objective alternately constraints on the
diversity of the strategies and on the extrinsic reward. We solve the
constrained optimization problem by casting it as a probabilistic inference
task and use policy iteration to maximize the derived lower bound. Experimental
results show that our method efficiently discovers diverse strategies in a wide
variety of reinforcement learning tasks. Compared to baseline methods, DGPO
achieves comparable rewards, while discovering more diverse strategies, and
often with better sample efficiency.
Related papers
- Human-in-the-Loop Policy Optimization for Preference-Based
Multi-Objective Reinforcement Learning [13.627087954965695]
We propose a human-in-the-loop policy optimization framework for preference-based MORL.
Our method proactively learns the DM's implicit preference information without requiring any priori knowledge.
We evaluate our approach against three conventional MORL algorithms and four state-of-the-art preference-based MORL algorithms.
arXiv Detail & Related papers (2024-01-04T09:17:53Z) - Policy Diversity for Cooperative Agents [8.689289576285095]
Multi-agent reinforcement learning aims to find the optimal team cooperative policy to complete a task.
There may exist multiple different ways of cooperating, which usually are very needed by domain experts.
Unfortunately, there is a general lack of effective policy diversity approaches specifically designed for the multi-agent domain.
arXiv Detail & Related papers (2023-08-28T05:23:16Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Sample-Efficient Multi-Objective Learning via Generalized Policy
Improvement Prioritization [8.836422771217084]
Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences.
We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes.
We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks.
arXiv Detail & Related papers (2023-01-18T20:54:40Z) - Discovering Policies with DOMiNO: Diversity Optimization Maintaining
Near Optimality [26.69352834457256]
We formalize the problem as a Constrained Markov Decision Process.
The objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set.
We demonstrate that the method can discover diverse and meaningful behaviors in various domains.
arXiv Detail & Related papers (2022-05-26T17:40:52Z) - CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Discovering Diverse Nearly Optimal Policies withSuccessor Features [30.144946007098852]
In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness.
We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features.
arXiv Detail & Related papers (2021-06-01T17:56:13Z) - Imitation Learning from MPC for Quadrupedal Multi-Gait Control [63.617157490920505]
We present a learning algorithm for training a single policy that imitates multiple gaits of a walking robot.
We use and extend MPC-Net, which is an Imitation Learning approach guided by Model Predictive Control.
We validate our approach on hardware and show that a single learned policy can replace its teacher to control multiple gaits.
arXiv Detail & Related papers (2021-03-26T08:48:53Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.