Thompson Sampling for Infinite-Horizon Discounted Decision Processes
        - URL: http://arxiv.org/abs/2405.08253v2
 - Date: Fri, 17 May 2024 05:58:07 GMT
 - Title: Thompson Sampling for Infinite-Horizon Discounted Decision Processes
 - Authors: Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal, 
 - Abstract summary: We study the behavior of a sampling-based algorithm, called Thompson sampling.
By decomposing the standard (expected) regret, we develop a new metric, called the expected residual regret.
 - Score: 0.0
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   We model a Markov decision process, parametrized by an unknown parameter, and study the asymptotic behavior of a sampling-based algorithm, called Thompson sampling. The standard definition of regret is not always suitable to evaluate a policy, especially when the underlying chain structure is general. We show that the standard (expected) regret can grow (super-)linearly and fails to capture the notion of learning in realistic settings with non-trivial state evolution. By decomposing the standard (expected) regret, we develop a new metric, called the expected residual regret, which forgets the immutable consequences of past actions. Instead, it measures regret against the optimal reward moving forward from the current period. We show that the expected residual regret of the Thompson sampling algorithm is upper bounded by a term which converges exponentially fast to 0. We present conditions under which the posterior sampling error of Thompson sampling converges to 0 almost surely. We then introduce the probabilistic version of the expected residual regret and present conditions under which it converges to 0 almost surely. Thus, we provide a viable concept of learning for sampling algorithms which will serve useful in broader settings than had been considered previously. 
 
       
      
        Related papers
        - Decision from Suboptimal Classifiers: Excess Risk Pre- and   Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making.
We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss.
On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv  Detail & Related papers  (2025-03-23T10:52:36Z) - An Adversarial Analysis of Thompson Sampling for Full-information Online   Learning: from Finite to Infinite Action Spaces [54.37047702755926]
We develop an analysis of Thompson sampling for online learning under full feedback.
We show regret decomposes into regret the learner expected a priori, plus a prior-robustness-type term we call excess regret.
arXiv  Detail & Related papers  (2025-02-20T18:10:12Z) - Contextual Thompson Sampling via Generation of Missing Data [11.713451719120707]
We introduce a framework for Thompson sampling contextual bandit algorithms.
Our algorithm treats uncertainty as stemming from missing, but potentially observable, future outcomes.
Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to impute missing future outcomes.
arXiv  Detail & Related papers  (2025-02-10T21:57:00Z) - Approximate Thompson Sampling for Learning Linear Quadratic Regulators   with $O(\sqrt{T})$ Regret [10.541541376305243]
We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(sqrtT)$.
We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process.
arXiv  Detail & Related papers  (2024-05-29T03:24:56Z) - The Sliding Regret in Stochastic Bandits: Discriminating Index and
  Randomized Policies [0.8158530638728501]
We study the one-shot behavior of no-regret algorithms for bandits.
We introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity.
arXiv  Detail & Related papers  (2023-11-30T10:37:03Z) - Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs.
Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation.
We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv  Detail & Related papers  (2023-02-24T09:18:27Z) - Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits [17.11922027966447]
This work provides theoretical guarantees of Thompson sampling in high dimensional and sparse contextual bandits.
For faster computation, we use spike-and-slab prior to model the unknown parameter and variational inference instead of MCMC.
arXiv  Detail & Related papers  (2022-11-11T02:23:39Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
  Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv  Detail & Related papers  (2022-06-07T18:08:21Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
  Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv  Detail & Related papers  (2021-08-25T17:09:01Z) - Asymptotic Convergence of Thompson Sampling [0.0]
Thompson sampling has been shown to be an effective policy across a variety of online learning tasks.
We prove an convergence result for Thompson sampling under the assumption of a sub-linear Bayesian regret.
Our results rely on the martingale structure inherent in Thompson sampling.
arXiv  Detail & Related papers  (2020-11-08T07:36:49Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
  Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv  Detail & Related papers  (2020-06-17T05:48:33Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv  Detail & Related papers  (2020-03-03T21:24:39Z) - On Thompson Sampling with Langevin Algorithms [106.78254564840844]
Thompson sampling for multi-armed bandit problems enjoys favorable performance in both theory and practice.
It suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration.
We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue.
arXiv  Detail & Related papers  (2020-02-23T22:35:29Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.