Related papers: Greedy Bandits with Sampled Context

Greedy Bandits with Sampled Context

URL: http://arxiv.org/abs/2007.16001v1
Date: Mon, 27 Jul 2020 17:17:45 GMT
Title: Greedy Bandits with Sampled Context
Authors: Dom Huh
Abstract summary: Greedy Bandits with Sampled Context (GB-SC) is a method for contextual multi-armed bandits to develop the prior from context information. Our results show competitive performance on the Mushroom environment in terms of expected regret and expected cumulative regret.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bayesian strategies for contextual bandits have proved promising in single-state reinforcement learning tasks by modeling uncertainty using context information from the environment. In this paper, we propose Greedy Bandits with Sampled Context (GB-SC), a method for contextual multi-armed bandits to develop the prior from the context information using Thompson Sampling, and arm selection using an epsilon-greedy policy. The framework GB-SC allows for evaluation of context-reward dependency, as well as providing robustness for partially observable context vectors by leveraging the prior developed. Our experimental results show competitive performance on the Mushroom environment in terms of expected regret and expected cumulative regret, as well as insights on how each context subset affects decision-making.

Related papers

Context-DPO: Aligning Language Models for Context-Faithfulness [80.62221491884353]
We propose the first alignment method specifically designed to enhance large language models' context-faithfulness. By leveraging faithful and stubborn responses to questions with provided context from ConFiQA, our Context-DPO aligns LLMs through direct preference optimization. Extensive experiments demonstrate that our Context-DPO significantly improves context-faithfulness, achieving 35% to 280% improvements on popular open-source models.
arXiv Detail & Related papers (2024-12-18T04:08:18Z)
Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set. Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity. For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z)
Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis [4.297070083645049]
We investigate a contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context. Our objective is to design an action policy that can approximate" that of an oracle.
arXiv Detail & Related papers (2024-01-21T18:57:38Z)
Online learning in bandits with predicted context [8.257280652461159]
We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context. This setting is motivated by a wide range of applications where the true context for decision-making is unobserved. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions.
arXiv Detail & Related papers (2023-07-26T02:33:54Z)
Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian rewards [44.025369660607645]
We study the performance of the Thompson Sampling algorithm for Contextual Bandit problems. We introduce new bounds on the lifted information ratio that hold for sub-Gaussian rewards.
arXiv Detail & Related papers (2023-04-26T14:40:01Z)
Contextual bandits with concave rewards, and an application to fair ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR) We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z)
Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations [1.370633147306388]
This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms.
arXiv Detail & Related papers (2022-04-10T21:27:56Z)
Risk-Constrained Thompson Sampling for CVaR Bandits [82.47796318548306]
We consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR) We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure.
arXiv Detail & Related papers (2020-11-16T15:53:22Z)
Stochastic Linear Contextual Bandits with Diverse Contexts [17.35270010828849]
We show that when contexts are sufficiently diverse, the learner is able to utilize the information obtained during exploitation to shorten the exploration process. We design the LinUCB-d algorithm, and propose a novel approach to analyze its regret performance.
arXiv Detail & Related papers (2020-03-05T14:51:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.