Policy-Based Self-Competition for Planning Problems
- URL: http://arxiv.org/abs/2306.04403v1
- Date: Wed, 7 Jun 2023 13:02:24 GMT
- Title: Policy-Based Self-Competition for Planning Problems
- Authors: Jonathan Pirnay, Quirin G\"ottl, Jakob Burger, Dominik Gerhard Grimm
- Abstract summary: We use self-competition to transform a single-player task into a binary output.
We show the effectiveness of our approach in two well-known optimization problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AlphaZero-type algorithms may stop improving on single-player tasks in case
the value network guiding the tree search is unable to approximate the outcome
of an episode sufficiently well. One technique to address this problem is
transforming the single-player task through self-competition. The main idea is
to compute a scalar baseline from the agent's historical performances and to
reshape an episode's reward into a binary output, indicating whether the
baseline has been exceeded or not. However, this baseline only carries limited
information for the agent about strategies how to improve. We leverage the idea
of self-competition and directly incorporate a historical policy into the
planning process instead of its scalar performance. Based on the recently
introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan'
(GAZ PTP), in which the agent learns to find strong trajectories by planning
against possible strategies of its past self. We show the effectiveness of our
approach in two well-known combinatorial optimization problems, the Traveling
Salesman Problem and the Job-Shop Scheduling Problem. With only half of the
simulation budget for search, GAZ PTP consistently outperforms all selected
single-player variants of GAZ.
Related papers
- Parallel Strategies for Best-First Generalized Planning [51.713634067802104]
Generalized planning (GP) is a research area of AI that studies the automated synthesis of algorithmic-like solutions capable of solving multiple classical planning instances.
One of the current advancements has been the introduction of Best-First Generalized Planning (BFGP), a GP algorithm based on a novel solution space that can be explored with search.
This paper evaluates the application of parallel search techniques to BFGP, another critical component in closing the performance gap.
arXiv Detail & Related papers (2024-07-31T09:50:22Z) - AlphaZeroES: Direct score maximization outperforms planning loss minimization [61.17702187957206]
Planning at execution time has been shown to dramatically improve performance for agents in both single-agent and multi-agent settings.
A family of approaches to planning at execution time are AlphaZero and its variants, which use Monte Carlo Tree Search together with a neural network that guides the search by predicting state values and action probabilities.
We show that, across multiple environments, directly maximizing the episode score outperforms minimizing the planning loss.
arXiv Detail & Related papers (2024-06-12T23:00:59Z) - Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning [1.6574413179773757]
We show that a more detailed theoretical understanding of MAB literature helps improve existing planning algorithms.
We propose GreedyUCT-Normal, a MCTS/THTS algorithm with UCB1-Normal bandit for agile classical planning.
arXiv Detail & Related papers (2023-05-16T22:46:37Z) - Distributed Task Management in Fog Computing: A Socially Concave Bandit
Game [7.708904950194129]
Fog computing leverages the task offloading capabilities at the network's edge to improve efficiency and enable swift responses to application demands.
We formulate the distributed task allocation problem as a social-concave game with bandit feedback.
We develop two no-regret online decision-making strategies.
arXiv Detail & Related papers (2022-03-28T08:26:14Z) - Planning and Learning with Adaptive Lookahead [74.39132848733847]
Policy Iteration (PI) algorithm alternates between greedy one-step policy improvement and policy evaluation.
Recent literature shows that multi-step lookahead policy improvement leads to a better convergence rate at the expense of increased complexity per iteration.
We propose for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate.
arXiv Detail & Related papers (2022-01-28T20:26:55Z) - COPS: Controlled Pruning Before Training Starts [68.8204255655161]
State-of-the-art deep neural network (DNN) pruning techniques, applied one-shot before training starts, evaluate sparse architectures with the help of a single criterion -- called pruning score.
In this work we do not concentrate on a single pruning criterion, but provide a framework for combining arbitrary GSSs to create more powerful pruning strategies.
arXiv Detail & Related papers (2021-07-27T08:48:01Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - An Efficient Algorithm for Cooperative Semi-Bandits [0.0]
We introduce Coop-FTPL, a cooperative version of the well-known Follow The Perturbed Leader algorithm.
We show that the expected regret of our algorithm after T time steps is of order QT log(k)(k$alpha$ 1 /Q + m), where Q is the total activation probability mass.
arXiv Detail & Related papers (2020-10-05T07:08:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.