Stochastic contextual bandits with graph feedback: from independence
number to MAS number
- URL: http://arxiv.org/abs/2402.18591v1
- Date: Mon, 12 Feb 2024 06:56:13 GMT
- Title: Stochastic contextual bandits with graph feedback: from independence
number to MAS number
- Authors: Yuxiao Wen, Yanjun Han, Zhengyuan Zhou
- Abstract summary: We consider contextual bandits with graph feedback, a class of interactive learning problems with richer structures than vanilla contextual bandits.
We provide algorithms that achieve near-optimal regrets for important classes of context sequences and/or feedback graphs.
- Score: 31.58368435306069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider contextual bandits with graph feedback, a class of interactive
learning problems with richer structures than vanilla contextual bandits, where
taking an action reveals the rewards for all neighboring actions in the
feedback graph under all contexts. Unlike the multi-armed bandits setting where
a growing literature has painted a near-complete understanding of graph
feedback, much remains unexplored in the contextual bandits counterpart. In
this paper, we make inroads into this inquiry by establishing a regret lower
bound $\Omega(\sqrt{\beta_M(G) T})$, where $M$ is the number of contexts, $G$
is the feedback graph, and $\beta_M(G)$ is our proposed graph-theoretical
quantity that characterizes the fundamental learning limit for this class of
problems. Interestingly, $\beta_M(G)$ interpolates between $\alpha(G)$ (the
independence number of the graph) and $\mathsf{m}(G)$ (the maximum acyclic
subgraph (MAS) number of the graph) as the number of contexts $M$ varies. We
also provide algorithms that achieve near-optimal regrets for important classes
of context sequences and/or feedback graphs, such as transitively closed graphs
that find applications in auctions and inventory control. In particular, with
many contexts, our results show that the MAS number completely characterizes
the statistical complexity for contextual bandits, as opposed to the
independence number in multi-armed bandits.
Related papers
- G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [61.93058781222079]
We develop a flexible question-answering framework targeting real-world textual graphs.
We introduce the first retrieval-augmented generation (RAG) approach for general textual graphs.
G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem.
arXiv Detail & Related papers (2024-02-12T13:13:04Z) - Online Learning with Feedback Graphs: The True Shape of Regret [82.00098840619847]
We prove that the minimax regret is proportional to $R*$ for any graph and time horizon $T$.
Introducing an intricate exploration strategy, we define the mainAlgorithm algorithm that achieves the minimax optimal regret bound.
arXiv Detail & Related papers (2023-06-05T15:35:00Z) - Stochastic Contextual Bandits with Graph-based Contexts [0.0]
We generalize the on-line graph prediction problem to a version of contextual bandit problems.
Our algorithm relies on the optimal bandit algorithm by Zimmert and Seldin[AISTAT'19, JMLR'21].
arXiv Detail & Related papers (2023-05-02T14:51:35Z) - Improved High-Probability Regret for Adversarial Bandits with
Time-Varying Feedback Graphs [62.52390282012508]
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds.
We develop an algorithm that achieves the optimal regret $widetildemathcalO((sum_t=1Talpha_t)1/2+max_tin[T]alpha_t]$ with high probability.
We also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs.
arXiv Detail & Related papers (2022-10-04T04:36:15Z) - Multi-armed Bandit Learning on a Graph [0.0]
We study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes.
We design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism.
Our proposed algorithm achieves $O(sqrt|S|Tlog(T)+D|S|log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph.
arXiv Detail & Related papers (2022-09-20T02:31:42Z) - A Near-Optimal Best-of-Both-Worlds Algorithm for Online Learning with
Feedback Graphs [21.563733343861713]
We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set.
We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both and adversarial environments.
arXiv Detail & Related papers (2022-06-01T15:14:32Z) - Improved Algorithms for Bandit with Graph Feedback via Regret
Decomposition [2.3034251503343466]
The problem of bandit with graph feedback generalizes both the multi-armed bandit (MAB) problem and the learning with expert advice problem.
We propose a new algorithmic framework for the problem based on a partition of the feedback graph.
arXiv Detail & Related papers (2022-05-30T13:07:42Z) - Adversarial Linear Contextual Bandits with Graph-Structured Side
Observations [80.95090605985042]
A learning agent repeatedly chooses from a set of $K$ actions after being presented with a $d$-dimensional context vector.
The agent incurs and observes the loss of the chosen action, but also observes the losses of its neighboring actions in the observation structures.
Two efficient algorithms are developed based on textttEXP3.
arXiv Detail & Related papers (2020-12-10T15:40:07Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.