AlphaZeroES: Direct score maximization outperforms planning loss minimization
- URL: http://arxiv.org/abs/2406.08687v1
- Date: Wed, 12 Jun 2024 23:00:59 GMT
- Title: AlphaZeroES: Direct score maximization outperforms planning loss minimization
- Authors: Carlos Martin, Tuomas Sandholm,
- Abstract summary: Planning at execution time has been shown to dramatically improve performance for agents in both single-agent and multi-agent settings.
A family of approaches to planning at execution time are AlphaZero and its variants, which use Monte Carlo Tree Search together with a neural network that guides the search by predicting state values and action probabilities.
We show that, across multiple environments, directly maximizing the episode score outperforms minimizing the planning loss.
- Score: 61.17702187957206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Planning at execution time has been shown to dramatically improve performance for agents in both single-agent and multi-agent settings. A well-known family of approaches to planning at execution time are AlphaZero and its variants, which use Monte Carlo Tree Search together with a neural network that guides the search by predicting state values and action probabilities. AlphaZero trains these networks by minimizing a planning loss that makes the value prediction match the episode return, and the policy prediction at the root of the search tree match the output of the full tree expansion. AlphaZero has been applied to both single-agent environments (such as Sokoban) and multi-agent environments (such as chess and Go) with great success. In this paper, we explore an intriguing question: In single-agent environments, can we outperform AlphaZero by directly maximizing the episode score instead of minimizing this planning loss, while leaving the MCTS algorithm and neural architecture unchanged? To directly maximize the episode score, we use evolution strategies, a family of algorithms for zeroth-order blackbox optimization. Our experiments indicate that, across multiple environments, directly maximizing the episode score outperforms minimizing the planning loss.
Related papers
- Bandits with Preference Feedback: A Stackelberg Game Perspective [41.928798759636216]
Bandits with preference feedback present a powerful tool for optimizing unknown target functions.
We propose MAXMINLCB, which emulates a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards.
arXiv Detail & Related papers (2024-06-24T15:53:11Z) - Self-Adjusting Evolutionary Algorithms Are Slow on Multimodal Landscapes [0.0]
We show that the positive results do not extend to other types of local optima.
On the distorted OneMax benchmark, the self-adjusting $(1, lambda)$-EA is slowed down just as elitist algorithms because self-adaptation prevents the algorithm from escaping from local optima.
arXiv Detail & Related papers (2024-04-18T10:01:08Z) - Pruning Convolutional Filters via Reinforcement Learning with Entropy
Minimization [0.0]
We introduce a novel information-theoretic reward function which minimizes the spatial entropy of convolutional activations.
Our method shows that there is another possibility to preserve accuracy without the need to directly optimize it in the agent's reward function.
arXiv Detail & Related papers (2023-12-08T09:34:57Z) - Tree-Planner: Efficient Close-loop Task Planning with Large Language Models [63.06270302774049]
Tree-Planner reframes task planning with Large Language Models into three distinct phases.
Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.
arXiv Detail & Related papers (2023-10-12T17:59:50Z) - AI planning in the imagination: High-level planning on learned abstract
search spaces [68.75684174531962]
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training.
We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman.
arXiv Detail & Related papers (2023-08-16T22:47:16Z) - Policy-Based Self-Competition for Planning Problems [0.0]
We use self-competition to transform a single-player task into a binary output.
We show the effectiveness of our approach in two well-known optimization problems.
arXiv Detail & Related papers (2023-06-07T13:02:24Z) - A Differentiable Loss Function for Learning Heuristics in A* [0.0]
This paper argues that this does not necessarily lead to a faster search of A* algorithm since its execution relies on relative values instead of absolute ones.
As a mitigation, we propose a L* loss, which upper-bounds the number of excessively expanded states inside the A* search.
The L* loss, when used in the optimization of state-of-the-art deep neural networks for automated planning in maze domains like Sokoban and maze with teleports, significantly improves the fraction of solved problems, the quality of founded plans, and reduces the number of expanded states to approximately 50%.
arXiv Detail & Related papers (2022-09-12T12:43:05Z) - COPS: Controlled Pruning Before Training Starts [68.8204255655161]
State-of-the-art deep neural network (DNN) pruning techniques, applied one-shot before training starts, evaluate sparse architectures with the help of a single criterion -- called pruning score.
In this work we do not concentrate on a single pruning criterion, but provide a framework for combining arbitrary GSSs to create more powerful pruning strategies.
arXiv Detail & Related papers (2021-07-27T08:48:01Z) - Community detection using fast low-cardinality semidefinite programming [94.4878715085334]
We propose a new low-cardinality algorithm that generalizes the local update to maximize a semidefinite relaxation derived from Leiden-k-cut.
This proposed algorithm is scalable, outperforms state-of-the-art algorithms, and outperforms in real-world time with little additional cost.
arXiv Detail & Related papers (2020-12-04T15:46:30Z) - MLE-guided parameter search for task loss minimization in neural
sequence modeling [83.83249536279239]
Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks.
We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient.
Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
arXiv Detail & Related papers (2020-06-04T22:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.