Multi-armed Bandit Learning on a Graph
- URL: http://arxiv.org/abs/2209.09419v4
- Date: Mon, 20 Mar 2023 17:06:12 GMT
- Title: Multi-armed Bandit Learning on a Graph
- Authors: Tianpeng Zhang (1), Kasper Johansson (2), Na Li (1)((1) Harvard
University, (2) Stanford University)
- Abstract summary: We study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes.
We design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism.
Our proposed algorithm achieves $O(sqrt|S|Tlog(T)+D|S|log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-armed bandit(MAB) problem is a simple yet powerful framework that
has been extensively studied in the context of decision-making under
uncertainty. In many real-world applications, such as robotic applications,
selecting an arm corresponds to a physical action that constrains the choices
of the next available arms (actions). Motivated by this, we study an extension
of MAB called the graph bandit, where an agent travels over a graph to maximize
the reward collected from different nodes. The graph defines the agent's
freedom in selecting the next available nodes at each step. We assume the graph
structure is fully available, but the reward distributions are unknown. Built
upon an offline graph-based planning algorithm and the principle of optimism,
we design a learning algorithm, G-UCB, that balances long-term
exploration-exploitation using the principle of optimism. We show that our
proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret,
where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which
matches the theoretical lower bound $\Omega(\sqrt{|S|T})$ up to logarithmic
factors. To our knowledge, this result is among the first tight regret bounds
in non-episodic, un-discounted learning problems with known deterministic
transitions. Numerical experiments confirm that our algorithm outperforms
several benchmarks.
Related papers
- Learning-Based Algorithms for Graph Searching Problems [6.923787372512553]
We consider the problem of graph searching with prediction recently introduced by Banerjee et al.
In this problem, an agent, starting at some $r$ has to traverse a (potentially unknown) graph $G$ to find a hidden goal node $g$.
We design algorithms for this search task on unknown graphs.
arXiv Detail & Related papers (2024-02-27T18:12:58Z) - Stochastic contextual bandits with graph feedback: from independence number to MAS number [28.10186884181921]
We consider contextual bandits with graph feedback.
Unlike the multi-armed bandits setting, much remains unexplored in the contextual bandits counterpart.
We provide algorithms that achieve near-optimal regret for important classes of context sequences and/or feedback graphs.
arXiv Detail & Related papers (2024-02-12T06:56:13Z) - Improved High-Probability Regret for Adversarial Bandits with
Time-Varying Feedback Graphs [62.52390282012508]
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds.
We develop an algorithm that achieves the optimal regret $widetildemathcalO((sum_t=1Talpha_t)1/2+max_tin[T]alpha_t]$ with high probability.
We also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs.
arXiv Detail & Related papers (2022-10-04T04:36:15Z) - Graph Neural Network Bandits [89.31889875864599]
We consider the bandit optimization problem with the reward function defined over graph-structured data.
Key challenges in this setting are scaling to large domains, and to graphs with many nodes.
We show that graph neural networks (GNNs) can be used to estimate the reward function.
arXiv Detail & Related papers (2022-07-13T18:12:36Z) - Simultaneously Learning Stochastic and Adversarial Bandits with General
Graph Feedback [15.429356827868514]
We introduce a new trade-off mechanism for exploration and exploitation of general feedback graphs.
We prove the proposed algorithm simultaneously achieves $mathrmpoly-designed log T$ regret in the adversarial setting.
This is the first best-of-both-worlds result for general feedback graphs.
arXiv Detail & Related papers (2022-06-16T04:21:27Z) - A Near-Optimal Best-of-Both-Worlds Algorithm for Online Learning with
Feedback Graphs [21.563733343861713]
We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set.
We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both and adversarial environments.
arXiv Detail & Related papers (2022-06-01T15:14:32Z) - Node Feature Extraction by Self-Supervised Multi-scale Neighborhood
Prediction [123.20238648121445]
We propose a new self-supervised learning framework, Graph Information Aided Node feature exTraction (GIANT)
GIANT makes use of the eXtreme Multi-label Classification (XMC) formalism, which is crucial for fine-tuning the language model based on graph information.
We demonstrate the superior performance of GIANT over the standard GNN pipeline on Open Graph Benchmark datasets.
arXiv Detail & Related papers (2021-10-29T19:55:12Z) - Causal Bandits on General Graphs [1.4502611532302039]
We study the problem of determining the best intervention in a Causal Bayesian Network (CBN) specified only by its causal graph.
We propose a simple regret minimization algorithm that takes as input a semi-Markovian causal graph with atomic interventions and possibly unobservable variables.
Our results indicate that the simple regret guarantee of our proposed algorithm can only be improved by considering more nuanced structural restrictions on the causal graph.
arXiv Detail & Related papers (2021-07-06T17:29:45Z) - Learnable Graph Matching: Incorporating Graph Partitioning with Deep
Feature Learning for Multiple Object Tracking [58.30147362745852]
Data association across frames is at the core of Multiple Object Tracking (MOT) task.
Existing methods mostly ignore the context information among tracklets and intra-frame detections.
We propose a novel learnable graph matching method to address these issues.
arXiv Detail & Related papers (2021-03-30T08:58:45Z) - Adversarial Linear Contextual Bandits with Graph-Structured Side
Observations [80.95090605985042]
A learning agent repeatedly chooses from a set of $K$ actions after being presented with a $d$-dimensional context vector.
The agent incurs and observes the loss of the chosen action, but also observes the losses of its neighboring actions in the observation structures.
Two efficient algorithms are developed based on textttEXP3.
arXiv Detail & Related papers (2020-12-10T15:40:07Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.