Reward Biased Maximum Likelihood Estimation for Reinforcement Learning
- URL: http://arxiv.org/abs/2011.07738v3
- Date: Sat, 15 May 2021 20:47:58 GMT
- Title: Reward Biased Maximum Likelihood Estimation for Reinforcement Learning
- Authors: Akshay Mete, Rahul Singh, Xi Liu and P. R. Kumar
- Abstract summary: Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed.
We show that it has a regret of $mathcalO( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms.
- Score: 13.820705458648233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of
Markov chains was proposed to overcome the central obstacle of what is
variously called the fundamental "closed-identifiability problem" of adaptive
control, the "dual control problem", or, contemporaneously, the "exploration
vs. exploitation problem". It exploited the key observation that since the
maximum likelihood parameter estimator can asymptotically identify the
closed-transition probabilities under a certainty equivalent approach, the
limiting parameter estimates must necessarily have an optimal reward that is
less than the optimal reward attainable for the true but unknown system. Hence
it proposed a counteracting reverse bias in favor of parameters with larger
optimal rewards, providing a solution to the fundamental problem alluded to
above. It thereby proposed an optimistic approach of favoring parameters with
larger optimal rewards, now known as "optimism in the face of uncertainty". The
RBMLE approach has been proved to be long-term average reward optimal in a
variety of contexts. However, modern attention is focused on the much finer
notion of "regret", or finite-time performance. Recent analysis of RBMLE for
multi-armed stochastic bandits and linear contextual bandits has shown that it
not only has state-of-the-art regret, but it also exhibits empirical
performance comparable to or better than the best current contenders, and leads
to strikingly simple index policies. Motivated by this, we examine the
finite-time performance of RBMLE for reinforcement learning tasks that involve
the general problem of optimal control of unknown Markov Decision Processes. We
show that it has a regret of $\mathcal{O}( \log T)$ over a time horizon of $T$
steps, similar to state-of-the-art algorithms. Simulation studies show that
RBMLE outperforms other algorithms such as UCRL2 and Thompson Sampling.
Related papers
- Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities.
We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability.
The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z) - Neural Contextual Bandits via Reward-Biased Maximum Likelihood
Estimation [9.69596041242667]
Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs.
This paper studies the contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration.
We show that both algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.
arXiv Detail & Related papers (2022-03-08T16:33:36Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed
Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$.
We propose a novel robust estimator which does not require $nu_p$ as prior information.
We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem.
Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL.
In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.