On Bits and Bandits: Quantifying the Regret-Information Trade-off
- URL: http://arxiv.org/abs/2405.16581v1
- Date: Sun, 26 May 2024 14:18:38 GMT
- Title: On Bits and Bandits: Quantifying the Regret-Information Trade-off
- Authors: Itai Shufaro, Nadav Merlis, Nir Weinberger, Shie Mannor,
- Abstract summary: In interactive decision-making tasks, information can be acquired by direct interactions, through receiving indirect feedback, and from external knowledgeable sources.
We show that information from external sources, measured in bits, can be traded off for regret, measured in reward.
We introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates.
- Score: 62.64904903955711
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In interactive decision-making tasks, information can be acquired by direct interactions, through receiving indirect feedback, and from external knowledgeable sources. We examine the trade-off between the information an agent accumulates and the regret it suffers. We show that information from external sources, measured in bits, can be traded off for regret, measured in reward. We invoke information-theoretic methods for obtaining regret lower bounds, that also allow us to easily re-derive several known lower bounds. We then generalize a variety of interactive decision-making tasks with external information to a new setting. Using this setting, we introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates. These lower bounds also prove the near-optimality of Thompson sampling for Bayesian problems. Finally, we demonstrate the utility of these bounds in improving the performance of a question-answering task with large language models, allowing us to obtain valuable insights.
Related papers
- Multi-Agent Imitation Learning: Value is Easy, Regret is Hard [52.31989962031179]
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents.
Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations.
While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
arXiv Detail & Related papers (2024-06-06T16:18:20Z) - Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent [50.508669199496474]
We develop a ReAct-style LLM agent with the ability to reason and act upon external knowledge.
We refine the agent through a ReST-like method that iteratively trains on previous trajectories.
Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model.
arXiv Detail & Related papers (2023-12-15T18:20:15Z) - Exploiting Correlated Auxiliary Feedback in Parameterized Bandits [56.84649080789685]
We study a novel variant of the parameterized bandits problem in which the learner can observe additional auxiliary feedback that is correlated with the observed reward.
The auxiliary feedback is readily available in many real-life applications, e.g., an online platform that wants to recommend the best-rated services to its users can observe the user's rating of service (rewards) and collect additional information like service delivery time (auxiliary feedback)
arXiv Detail & Related papers (2023-11-05T17:27:06Z) - Lifting the Information Ratio: An Information-Theoretic Analysis of
Thompson Sampling for Contextual Bandits [17.470829701201435]
We adapt the information-theoretic perspective of Russo and Van Roy [2016] to the contextual setting by introducing a new concept of information ratio.
This allows us to bound the regret in terms of the entropy of the prior distribution through a remarkably simple proof.
An interesting special case is that of logistic bandits with d-dimensional parameters, K actions, and Lipschitz logits, for which we provide a $widetildeO(sqrtdKT)$ regret upper-bound that does not depend on the smallest slope of the sigmoid link function.
arXiv Detail & Related papers (2022-05-27T12:04:07Z) - Remote Contextual Bandits [18.40166098572039]
We consider a remote contextual multi-armed bandit (CMAB) problem.
The decision-maker observes the context and the reward, but must communicate the actions to be taken by the agents over a rate-limited communication channel.
We study the fundamental information theoretic limits of this problem by letting the number of agents go to infinity, and study the regret achieved when Thompson sampling strategy is adopted.
arXiv Detail & Related papers (2022-02-10T17:31:20Z) - Role of collective information in networks of quantum operating agents [0.0]
A network of agents is considered whose decision processes are described by the quantum decision theory.
As a result of the interplay between these three contributions, the process of choice between several alternatives is multimodal.
The information field common to all agents tends to smooth out sharp variations in the temporal behaviour of the probabilities.
arXiv Detail & Related papers (2022-01-26T15:35:25Z) - A Bayesian Framework for Information-Theoretic Probing [51.98576673620385]
We argue that probing should be seen as approximating a mutual information.
This led to the rather unintuitive conclusion that representations encode exactly the same information about a target task as the original sentences.
This paper proposes a new framework to measure what we term Bayesian mutual information.
arXiv Detail & Related papers (2021-09-08T18:08:36Z) - A Bit Better? Quantifying Information for Bandit Learning [24.943571034827297]
The information ratio offers an approach to assessing the efficacy with which an agent balances between exploration and exploitation.
Recent work has inspired consideration of alternative information measures, particularly for use in analysis of bandit learning algorithms to arrive at tighter regret bounds.
We investigate whether quantification of information via such alternatives can improve the realized performance of information-directed sampling.
arXiv Detail & Related papers (2021-02-18T17:16:04Z) - Multi-Armed Bandits with Local Differential Privacy [32.538737981996405]
In bandit systems, the rewards may refer to the users' activities, which may involve private information.
We study the regret upper and lower bounds for MAB algorithms with a given LDP guarantee.
We prove a lower bound and propose algorithms whose regret upper bounds match the lower bound up to constant factors.
arXiv Detail & Related papers (2020-07-06T23:36:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.