Related papers: Can Q-Learning be Improved with Advice?

Can Q-Learning be Improved with Advice?

URL: http://arxiv.org/abs/2110.13052v1
Date: Mon, 25 Oct 2021 15:44:20 GMT
Title: Can Q-Learning be Improved with Advice?
Authors: Noah Golowich, Ankur Moitra
Abstract summary: This paper addresses the question of whether worst-case lower bounds for regret can be circumvented in online learning of Markov decision processes (MDPs) We show that when predictions about the optimal $Q$-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning.
Score: 27.24260290748049
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite rapid progress in theoretical reinforcement learning (RL) over the last few years, most of the known guarantees are worst-case in nature, failing to take advantage of structure that may be known a priori about a given RL problem at hand. In this paper we address the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal $Q$-value function, is given to the algorithm. We show that when the predictions about the optimal $Q$-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. This improvement holds for both uniform regret bounds and gap-based ones. Further, we are able to achieve this property with an algorithm that achieves sublinear regret when given arbitrary predictions (i.e., even those which are not a distillation). Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning.

Related papers

A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions [75.69959433669244]
We study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ per instance for prediction.<n>We introduce a new extend-time algorithm, which significantly improves previous regret bounds.
arXiv Detail & Related papers (2025-10-31T05:02:33Z)
Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making. We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv Detail & Related papers (2025-03-23T10:52:36Z)
Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions. We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z)
Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z)
No-Regret Reinforcement Learning in Smooth MDPs [24.249446550171307]
We introduce a novel structural assumption on the decision processes (MDPs) that generalizes most of the settings proposed so far. We propose two algorithms for regret minimization in $nu-$smoothness. We compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
arXiv Detail & Related papers (2024-02-06T08:18:14Z)
Constrained Online Two-stage Stochastic Optimization: Algorithm with (and without) Predictions [19.537289123577022]
We consider an online two-stage optimization with long-term constraints over a finite horizon of $T$ periods. We develop online algorithms for the online two-stage problem from adversarial learning algorithms.
arXiv Detail & Related papers (2024-01-02T07:46:33Z)
Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning [50.44564503645015]
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
arXiv Detail & Related papers (2021-07-02T20:36:05Z)
Double Coverage with Machine-Learned Advice [100.23487145400833]
We study the fundamental online $k$-server problem in a learning-augmented setting. We show that our algorithm achieves for any k an almost optimal consistency-robustness tradeoff.
arXiv Detail & Related papers (2021-03-02T11:04:33Z)
Exact Asymptotics for Linear Quadratic Adaptive Control [6.287145010885044]
We study the simplest non-bandit reinforcement learning problem: linear quadratic control (LQAC) We derive expressions for the regret, estimation error, and prediction error of a stepwise-updating LQAC algorithm. In simulations on both stable and unstable systems, we find that our theory also describes the algorithm's finite-sample behavior remarkably well.
arXiv Detail & Related papers (2020-11-02T22:43:30Z)
Optimal Robustness-Consistency Trade-offs for Learning-Augmented Online Algorithms [85.97516436641533]
We study the problem of improving the performance of online algorithms by incorporating machine-learned predictions. The goal is to design algorithms that are both consistent and robust. We provide the first set of non-trivial lower bounds for competitive analysis using machine-learned predictions.
arXiv Detail & Related papers (2020-10-22T04:51:01Z)
Towards Tractable Optimism in Model-Based Reinforcement Learning [37.51073590932658]
To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error) We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
arXiv Detail & Related papers (2020-06-21T20:53:19Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.