NeuPL: Neural Population Learning
- URL: http://arxiv.org/abs/2202.07415v1
- Date: Tue, 15 Feb 2022 14:05:18 GMT
- Title: NeuPL: Neural Population Learning
- Authors: Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, Thore
Graepel
- Abstract summary: Learning in strategy games requires the discovery of diverse policies.
This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit.
This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents.
- Score: 37.02099221741667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning in strategy games (e.g. StarCraft, poker) requires the discovery of
diverse policies. This is often achieved by iteratively training new policies
against existing ones, growing a policy population that is robust to exploit.
This iterative approach suffers from two issues in real-world games: a) under
finite budget, approximate best-response operators at each iteration needs
truncating, resulting in under-trained good-responses populating the
population; b) repeated learning of basic skills at each iteration is wasteful
and becomes intractable in the presence of increasingly strong opponents. In
this work, we propose Neural Population Learning (NeuPL) as a solution to both
issues. NeuPL offers convergence guarantees to a population of best-responses
under mild assumptions. By representing a population of policies within a
single conditional model, NeuPL enables transfer learning across policies.
Empirically, we show the generality, improved performance and efficiency of
NeuPL across several test domains. Most interestingly, we show that novel
strategies become more accessible, not less, as the neural population expands.
Related papers
- Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO)
Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.
With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - Neural Population Learning beyond Symmetric Zero-sum Games [52.20454809055356]
We introduce NeuPL-JPSRO, a neural population learning algorithm that benefits from transfer learning of skills and converges to a Coarse Correlated (CCE) of the game.
Our work shows that equilibrium convergent population learning can be implemented at scale and in generality.
arXiv Detail & Related papers (2024-01-10T12:56:24Z) - Population-size-Aware Policy Optimization for Mean-Field Games [34.80183622480149]
We study how the optimal policies of agents evolve with the number of agents (population size) in mean-field games.
We propose Population-size-Aware Policy Optimization (PAPO), which unifies two natural options (augmentation and hypernetwork) and significantly better performance.
PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, andiii) the population size as an additional input to the generated policy.
arXiv Detail & Related papers (2023-02-07T10:16:00Z) - Diversity Through Exclusion (DTE): Niche Identification for
Reinforcement Learning through Value-Decomposition [63.67574523750839]
We propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in environments with multiple variably-valued niches.
We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies.
arXiv Detail & Related papers (2023-02-02T16:00:19Z) - Winner Takes It All: Training Performant RL Populations for
Combinatorial Optimization [6.6765384699410095]
We argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference.
We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems.
arXiv Detail & Related papers (2022-10-07T11:58:08Z) - Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games [69.5064797859053]
We introduce emphSelf-Play PSRO (SP-PSRO), a method that adds an approximately optimal policy to the population in each iteration.
SP-PSRO empirically tends to converge much faster than APSRO and in many games converge in just a few iterations.
arXiv Detail & Related papers (2022-07-13T22:55:51Z) - Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in
Symmetric Zero-sum Games [36.19779736396775]
Learning to play optimally against any mixture over a diverse set of strategies is important practical interests in competitive games.
We propose simplex-NeuPL that satisfies two desiderata simultaneously.
We show that the resulting conditional policies incorporate prior information about their opponents effectively.
arXiv Detail & Related papers (2022-05-31T15:27:38Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Learning Large Neighborhood Search Policy for Integer Programming [14.089039170072084]
We propose a deep reinforcement learning (RL) method to learn large neighborhood search (LNS) policy for integer programming (IP)
We represent all subsets by factorizing them into binary decisions on each variable.
We then design a neural network to learn policies for each variable in parallel, trained by a customized actor-critic algorithm.
arXiv Detail & Related papers (2021-11-01T09:10:49Z) - Mean Field Games Flock! The Reinforcement Learning Way [34.67098179276852]
We present a method enabling a large number of agents to learn how to flock.
This is a natural behavior observed in large populations of animals.
We show numerically that our algorithm learn multi-group or high-dimensional flocking with obstacles.
arXiv Detail & Related papers (2021-05-17T15:17:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.