Neural Contextual Bandits via Reward-Biased Maximum Likelihood
Estimation
- URL: http://arxiv.org/abs/2203.04192v1
- Date: Tue, 8 Mar 2022 16:33:36 GMT
- Title: Neural Contextual Bandits via Reward-Biased Maximum Likelihood
Estimation
- Authors: Yu-Heng Hung, Ping-Chun Hsieh
- Abstract summary: Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs.
This paper studies the contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration.
We show that both algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.
- Score: 9.69596041242667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in
the adaptive control literature for tackling explore-exploit trade-offs. This
paper studies the stochastic contextual bandit problem with general bounded
reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by
adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE
leverages the representation power of neural networks and directly encodes
exploratory behavior in the parameter space, without constructing confidence
intervals of the estimated rewards. We propose two variants of NeuralRBMLE
algorithms: The first variant directly obtains the RBMLE estimator by gradient
ascent, and the second variant simplifies RBMLE to a simple index policy
through an approximation. We show that both algorithms achieve
$\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we
demonstrate that the NeuralRBMLE algorithms achieve comparable or better
empirical regrets than the state-of-the-art methods on real-world datasets with
non-linear reward functions.
Related papers
- Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates [7.21848268647674]
We integrate the $varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters.
Under a margin condition, our method achieves either $O(T1/2)$ regret or classical $O(T1/2)$-consistent inference.
arXiv Detail & Related papers (2024-11-10T01:47:11Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Robust Imitation via Mirror Descent Inverse Reinforcement Learning [18.941048578572577]
This paper proposes to predict a sequence of reward functions, which are iterative solutions for a constrained convex problem.
We prove that the proposed mirror descent update rule ensures robust minimization of a Bregman divergence.
Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks.
arXiv Detail & Related papers (2022-10-20T12:25:21Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation
in Contaminated Gaussian Models [1.609950046042424]
Consider the problem of simultaneous estimation of location and variance matrix under Huber's contaminated Gaussian model.
First, we study minimum $f$-divergence estimation at the population level, corresponding to a generative adversarial method with a nonparametric discriminator.
We develop tractable adversarial algorithms with simple spline discriminators, which can be implemented via nested optimization.
The proposed methods are shown to achieve minimax optimal rates or near-optimal rates depending on the $f$-divergence and the penalty used.
arXiv Detail & Related papers (2021-12-24T02:46:51Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - A unified view of likelihood ratio and reparameterization gradients [91.4645013545015]
We use a first principles approach to explain that LR and RP are alternative methods of keeping track of the movement of probability mass.
We show that the space of all possible estimators combining LR and RP can be completely parameterized by a flow field.
We prove that there cannot exist a single-sample estimator of this type outside our space, thus, clarifying where we should be searching for better Monte Carlo gradient estimators.
arXiv Detail & Related papers (2021-05-31T11:53:08Z) - Reward Biased Maximum Likelihood Estimation for Reinforcement Learning [13.820705458648233]
Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed.
We show that it has a regret of $mathcalO( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms.
arXiv Detail & Related papers (2020-11-16T06:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.