Reward-Biased Maximum Likelihood Estimation for Linear Stochastic
Bandits
- URL: http://arxiv.org/abs/2010.04091v1
- Date: Thu, 8 Oct 2020 16:17:53 GMT
- Title: Reward-Biased Maximum Likelihood Estimation for Linear Stochastic
Bandits
- Authors: Yu-Heng Hung, Ping-Chun Hsieh, Xi Liu and P. R. Kumar
- Abstract summary: We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods.
The new policies achieve this with low time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency.
- Score: 16.042075861624056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modifying the reward-biased maximum likelihood method originally proposed in
the adaptive control literature, we propose novel learning algorithms to handle
the explore-exploit trade-off in linear bandits problems as well as generalized
linear bandits problems. We develop novel index policies that we prove achieve
order-optimality, and show that they achieve empirical performance competitive
with the state-of-the-art benchmark methods in extensive experiments. The new
policies achieve this with low computation time per pull for linear bandits,
and thereby resulting in both favorable regret as well as computational
efficiency.
Related papers
- Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.
We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - Exploration via linearly perturbed loss minimisation [4.856378369489158]
We introduce exploration via linear loss perturbations (EVILL) for structured bandit problems.
We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards.
We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods.
arXiv Detail & Related papers (2023-11-13T18:54:43Z) - Convex Methods for Constrained Linear Bandits [2.5782420501870296]
This work presents a comprehensive study on the computational aspects of safe bandit algorithms, specifically safe linear bandits.
We first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms.
arXiv Detail & Related papers (2023-11-07T20:45:46Z) - Directional Optimism for Safe Linear Bandits [4.84955052297084]
The safe linear bandit problem is a version of the classical linear bandit problem where the learner's actions must satisfy an uncertain constraint at all rounds.
We find that it is possible to achieve improved regret guarantees for both well-separated problem instances and action sets that are finite star convex sets.
Lastly, we introduce a generalization of the safe linear bandit setting where the constraints are convex and adapt our algorithms and analyses to this setting by leveraging a novel convex-analysis based approach.
arXiv Detail & Related papers (2023-08-29T03:54:53Z) - PopArt: Efficient Sparse Regression and Experimental Design for Optimal
Sparse Linear Bandits [29.097522376094624]
We propose a simple and computationally efficient sparse linear estimation method called PopArt.
We derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art.
arXiv Detail & Related papers (2022-10-25T19:13:20Z) - A Provably Efficient Model-Free Posterior Sampling Method for Episodic
Reinforcement Learning [50.910152564914405]
Existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs.
This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees.
arXiv Detail & Related papers (2022-08-23T12:21:01Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - Experimental Design for Regret Minimization in Linear Bandits [19.8309784360219]
We propose a novel design-based algorithm to minimize regret in online linear and bandits.
We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime.
arXiv Detail & Related papers (2020-11-01T17:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.