Bad Values but Good Behavior: Learning Highly Misspecified Bandits and
MDPs
- URL: http://arxiv.org/abs/2310.09358v2
- Date: Thu, 22 Feb 2024 13:43:06 GMT
- Title: Bad Values but Good Behavior: Learning Highly Misspecified Bandits and
MDPs
- Authors: Debangshu Banerjee and Aditya Gopalan
- Abstract summary: Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs)
We show that basic algorithms such as $epsilon$-greedy, LinUCB and fitted Q-learning provably learn optimal policies under even highly misspecified models.
This is in contrast to existing worst-case results for, say misspecified bandits, which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.
- Score: 16.777565006843012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parametric, feature-based reward models are employed by a variety of
algorithms in decision-making settings such as bandits and Markov decision
processes (MDPs). The typical assumption under which the algorithms are
analysed is realizability, i.e., that the true values of actions are perfectly
explained by some parametric model in the class. We are, however, interested in
the situation where the true values are (significantly) misspecified with
respect to the model class. For parameterized bandits, contextual bandits and
MDPs, we identify structural conditions, depending on the problem instance and
model class, under which basic algorithms such as $\epsilon$-greedy, LinUCB and
fitted Q-learning provably learn optimal policies under even highly
misspecified models. This is in contrast to existing worst-case results for,
say misspecified bandits, which show regret bounds that scale linearly with
time, and shows that there can be a nontrivially large set of bandit instances
that are robust to misspecification.
Related papers
- Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences.
Our method is especially suitable for problems with well-specified likelihoods.
We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z) - Online Clustering of Bandits with Misspecified User Models [42.56440072468658]
Contextual linear bandit is an online learning problem where given arm features, a learning agent selects an arm at each round to maximize the cumulative rewards in the long run.
A line of works, called the clustering of bandits (CB), utilize the collaborative effect over user preferences and have shown significant improvements over classic linear bandit algorithms.
In this paper, we are the first to present the important problem of clustering of bandits with misspecified user models (CBMUM).
We devise two robust CB algorithms, RCLUMB and RSCLUMB, that can accommodate the inaccurate user preference estimations and erroneous clustering caused by model misspecifications.
arXiv Detail & Related papers (2023-10-04T10:40:50Z) - Oracle Inequalities for Model Selection in Offline Reinforcement
Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation.
We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors.
We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z) - Sample Complexity of Robust Reinforcement Learning with a Generative
Model [0.0]
We propose a model-based reinforcement learning (RL) algorithm for learning an $epsilon$-optimal robust policy.
We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence.
In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies.
arXiv Detail & Related papers (2021-12-02T18:55:51Z) - Adversarial Robustness Verification and Attack Synthesis in Stochastic
Systems [8.833548357664606]
We develop a formal framework for adversarial robustness in systems defined as discrete time Markov chains (DTMCs)
We outline a class of threat models under which adversaries can perturb system transitions, constrained by an $varepsilon$ ball around the original transition probabilities.
arXiv Detail & Related papers (2021-10-05T15:52:47Z) - Model Selection for Generic Contextual Bandits [20.207989166682832]
We propose a refinement based algorithm called Adaptive Contextual Bandit (ttfamily ACB)
We prove that this algorithm is adaptive, i.e., the regret rate order-wise matches that of any provable contextual bandit algorithm.
We also show that a much simpler explore-then-commit (ETC) style algorithm also obtains similar regret bound, despite not knowing the true model class.
arXiv Detail & Related papers (2021-07-07T19:35:31Z) - Instance-optimality in optimal value estimation: Adaptivity via
variance-reduced Q-learning [99.34907092347733]
We analyze the problem of estimating optimal $Q$-value functions for a discounted Markov decision process with discrete states and actions.
Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure.
In the other direction, we establish the sharpness of our lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of $Q$-learning.
arXiv Detail & Related papers (2021-06-28T00:38:54Z) - Towards Costless Model Selection in Contextual Bandits: A Bias-Variance
Perspective [7.318831153179727]
We study the feasibility of similar guarantees for cumulative regret minimization in the contextual bandit setting.
Our algorithm is based on a novel misspecification test, and our analysis demonstrates the benefits of using model selection for reward estimation.
arXiv Detail & Related papers (2021-06-11T16:08:03Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits.
We show that this discrepancy is due to the emphaction-stability of their objectives.
In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.