Related papers: Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs

Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs

URL: http://arxiv.org/abs/2310.09358v2
Date: Thu, 22 Feb 2024 13:43:06 GMT
Title: Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs
Authors: Debangshu Banerjee and Aditya Gopalan
Abstract summary: Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs) We show that basic algorithms such as $epsilon$-greedy, LinUCB and fitted Q-learning provably learn optimal policies under even highly misspecified models. This is in contrast to existing worst-case results for, say misspecified bandits, which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.
Score: 16.777565006843012
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs). The typical assumption under which the algorithms are analysed is realizability, i.e., that the true values of actions are perfectly explained by some parametric model in the class. We are, however, interested in the situation where the true values are (significantly) misspecified with respect to the model class. For parameterized bandits, contextual bandits and MDPs, we identify structural conditions, depending on the problem instance and model class, under which basic algorithms such as $\epsilon$-greedy, LinUCB and fitted Q-learning provably learn optimal policies under even highly misspecified models. This is in contrast to existing worst-case results for, say misspecified bandits, which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.

Related papers

Uncertainty Sets for Distributionally Robust Bandits Using Structural Equation Models [0.0]
Current methods for distributionally robust evaluation and learning create overly conservative evaluations and policies.<n>We propose a practical bandit evaluation and learning algorithm that tailors the uncertainty set to specific problems.
arXiv Detail & Related papers (2025-08-04T18:29:29Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences. Our method is especially suitable for problems with well-specified likelihoods. We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z)
Online Clustering of Bandits with Misspecified User Models [42.56440072468658]
Contextual linear bandit is an online learning problem where given arm features, a learning agent selects an arm at each round to maximize the cumulative rewards in the long run. A line of works, called the clustering of bandits (CB), utilize the collaborative effect over user preferences and have shown significant improvements over classic linear bandit algorithms. In this paper, we are the first to present the important problem of clustering of bandits with misspecified user models (CBMUM). We devise two robust CB algorithms, RCLUMB and RSCLUMB, that can accommodate the inaccurate user preference estimations and erroneous clustering caused by model misspecifications.
arXiv Detail & Related papers (2023-10-04T10:40:50Z)
Oracle Inequalities for Model Selection in Offline Reinforcement Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z)
Sample Complexity of Robust Reinforcement Learning with a Generative Model [0.0]
We propose a model-based reinforcement learning (RL) algorithm for learning an $epsilon$-optimal robust policy. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies.
arXiv Detail & Related papers (2021-12-02T18:55:51Z)
Adversarial Robustness Verification and Attack Synthesis in Stochastic Systems [8.833548357664606]
We develop a formal framework for adversarial robustness in systems defined as discrete time Markov chains (DTMCs) We outline a class of threat models under which adversaries can perturb system transitions, constrained by an $varepsilon$ ball around the original transition probabilities.
arXiv Detail & Related papers (2021-10-05T15:52:47Z)
Model Selection for Generic Contextual Bandits [20.207989166682832]
We propose a refinement based algorithm called Adaptive Contextual Bandit (ttfamily ACB) We prove that this algorithm is adaptive, i.e., the regret rate order-wise matches that of any provable contextual bandit algorithm. We also show that a much simpler explore-then-commit (ETC) style algorithm also obtains similar regret bound, despite not knowing the true model class.
arXiv Detail & Related papers (2021-07-07T19:35:31Z)
Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning [99.34907092347733]
We analyze the problem of estimating optimal $Q$-value functions for a discounted Markov decision process with discrete states and actions. Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure. In the other direction, we establish the sharpness of our lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of $Q$-learning.
arXiv Detail & Related papers (2021-06-28T00:38:54Z)
Towards Costless Model Selection in Contextual Bandits: A Bias-Variance Perspective [7.318831153179727]
We study the feasibility of similar guarantees for cumulative regret minimization in the contextual bandit setting. Our algorithm is based on a novel misspecification test, and our analysis demonstrates the benefits of using model selection for reward estimation.
arXiv Detail & Related papers (2021-06-11T16:08:03Z)
Characterizing Fairness Over the Set of Good Models Under Selective Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance. We provide tractable algorithms to compute the range of attainable group-level predictive disparities. We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits. We show that this discrepancy is due to the emphaction-stability of their objectives. In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.