Winner Takes It All: Training Performant RL Populations for
Combinatorial Optimization
- URL: http://arxiv.org/abs/2210.03475v2
- Date: Mon, 13 Nov 2023 23:32:29 GMT
- Title: Winner Takes It All: Training Performant RL Populations for
Combinatorial Optimization
- Authors: Nathan Grinsztajn, Daniel Furelos-Blanco, Shikha Surana, Cl\'ement
Bonnet, Thomas D. Barrett
- Abstract summary: We argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference.
We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems.
- Score: 6.6765384699410095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Applying reinforcement learning (RL) to combinatorial optimization problems
is attractive as it removes the need for expert knowledge or pre-solved
instances. However, it is unrealistic to expect an agent to solve these (often
NP-)hard problems in a single shot at inference due to their inherent
complexity. Thus, leading approaches often implement additional search
strategies, from stochastic sampling and beam search to explicit fine-tuning.
In this paper, we argue for the benefits of learning a population of
complementary policies, which can be simultaneously rolled out at inference. To
this end, we introduce Poppy, a simple training procedure for populations.
Instead of relying on a predefined or hand-crafted notion of diversity, Poppy
induces an unsupervised specialization targeted solely at maximizing the
performance of the population. We show that Poppy produces a set of
complementary policies, and obtains state-of-the-art RL results on four popular
NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1
knapsack, and job-shop scheduling.
Related papers
- Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO)
Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.
With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - Adversarial Imitation Learning On Aggregated Data [0.0]
Inverse Reinforcement Learning (IRL) learns an optimal policy, given some expert demonstrations, thus avoiding the need for the tedious process of specifying a suitable reward function.
We propose an approach which removes these requirements through a dynamic, adaptive method called Adversarial Imitation Learning on Aggregated Data (AILAD)
It learns conjointly both a non linear reward function and the associated optimal policy using an adversarial framework.
arXiv Detail & Related papers (2023-11-14T22:13:38Z) - Combinatorial Optimization with Policy Adaptation using Latent Space Search [44.12073954093942]
We present a novel approach for designing performant algorithms to solve complex, typically NP-hard, problems.
We show that our search strategy outperforms state-of-the-art approaches on 11 standard benchmarking tasks.
arXiv Detail & Related papers (2023-11-13T12:24:54Z) - Active Ranking of Experts Based on their Performances in Many Tasks [72.96112117037465]
We consider the problem of ranking n experts based on their performances on d tasks.
We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks.
arXiv Detail & Related papers (2023-06-05T06:55:39Z) - Reinforcement Learning for Branch-and-Bound Optimisation using
Retrospective Trajectories [72.15369769265398]
Machine learning has emerged as a promising paradigm for branching.
We propose retro branching; a simple yet effective approach to RL for branching.
We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables.
arXiv Detail & Related papers (2022-05-28T06:08:07Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - NeuPL: Neural Population Learning [37.02099221741667]
Learning in strategy games requires the discovery of diverse policies.
This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit.
This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents.
arXiv Detail & Related papers (2022-02-15T14:05:18Z) - Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems.
This stems from variance differences between policies, which causes two problems.
We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.