Related papers: Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics

Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics

URL: http://arxiv.org/abs/2202.05799v1
Date: Fri, 11 Feb 2022 17:50:14 GMT
Title: Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics
Authors: Feicheng Wang and Lucas Janson
Abstract summary: This paper establishes a novel regret upper-bound of $O_p(sqrtT,textpolylog(T))$. It simultaneously establishes an estimation error bound on the dynamics of $O_p(sqrtT,textpolylog(T))$.
Score: 6.287145010885044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The theory of reinforcement learning currently suffers from a mismatch between its empirical performance and the theoretical characterization of its performance, with consequences for, e.g., the understanding of sample efficiency, safety, and robustness. The linear quadratic regulator with unknown dynamics is a fundamental reinforcement learning setting with significant structure in its dynamics and cost function, yet even in this setting there is a gap between the best known regret lower-bound of $\Omega_p(\sqrt{T})$ and the best known upper-bound of $O_p(\sqrt{T}\,\text{polylog}(T))$. The contribution of this paper is to close that gap by establishing a novel regret upper-bound of $O_p(\sqrt{T})$. Our proof is constructive in that it analyzes the regret of a concrete algorithm, and simultaneously establishes an estimation error bound on the dynamics of $O_p(T^{-1/4})$ which is also the first to match the rate of a known lower-bound. The two keys to our improved proof technique are (1) a more precise upper- and lower-bound on the system Gram matrix and (2) a self-bounding argument for the expected estimation error of the optimal controller.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Causal Direction from Convergence Time: Faster Training in the True Causal Direction [0.0]
We introduce Causal Computational Asymmetry (CCA), a principle for causal direction identification based on optimization dynamics.<n>CCA operates in optimization-time space, distinguishing it from methods such as RESIT, IGCI, and SkewScore.<n>We further embed CCA into a broader framework termed Causal Compression Learning (CCL), which integrates graph structure learning, causal information compression, and policy optimization.
arXiv Detail & Related papers (2026-02-24T21:34:57Z)
Scaling Bidirectional Spans and Span Violations in Attention Mechanism [5.755498052202004]
canonical $O(N2)$ Transformer remains the empirical performance frontier in sequence modeling.<n>We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans.<n>We demonstrate that selectively scaling these components, focusing primarily on $0th$ order bidirectional parallel spans, yields the most effective learning signal.
arXiv Detail & Related papers (2025-12-15T07:03:24Z)
Binary perceptron computational gap -- a parametric fl RDT view [2.538209532048867]
asymmetric binary perceptron (ABP) likely exhibits two phase transitioning constraint density thresholds.<n>We consider a particular parametric utilization of emphfully lifted random duality theory (fl RDT) [85] and study its potential ABP's algorithmic implications.
arXiv Detail & Related papers (2025-11-02T18:23:49Z)
Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration [6.287267171078442]
We propose variance-aware algorithms that leverage neural networks to approximate nonlinear utility functions.<n>We establish theoretical guarantees showing that our algorithms achieve sublinear cumulative average regret of order $bigollt(d sqrtsum_t=1T sigma_t2 + sqrtdTrt),$ for sufficiently wide neural networks.
arXiv Detail & Related papers (2025-06-02T01:58:48Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning. We show that the widely used beam search method suffers from unacceptable over-optimism. We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization [29.174036532175855]
Learning rate in gradient methods is a critical hyperspecification that is notoriously costly to tune via standard grid search. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a rate, such as the widely-used cosine schedule.
arXiv Detail & Related papers (2025-03-12T14:06:34Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z)
Contextual Dynamic Pricing: Algorithms, Optimality, and Local Differential Privacy Constraints [10.057344315478709]
We study the contextual dynamic pricing problem where a firm sells products to $T$ sequentially arriving consumers. We first show that the optimal regret upper bound is of order $sqrtdT$, up to a logarithmic factor. A key insight of our theoretical result is an intrinsic connection between dynamic pricing and the contextual multi-armed bandit problem.
arXiv Detail & Related papers (2024-06-04T15:44:10Z)
Sub-linear Regret in Adaptive Model Predictive Control [56.705978425244496]
We present STT-MPC (Self-Tuning Tube-based Model Predictive Control), an online oracle that combines the certainty-equivalence principle and polytopic tubes. We analyze the regret of the algorithm, when compared to an algorithm initially aware of the system dynamics.
arXiv Detail & Related papers (2023-10-07T15:07:10Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates [17.777466668123886]
We introduce PROMISE ($textbfPr$econditioned $textbfO$ptimization $textbfM$ethods by $textbfI$ncorporating $textbfS$calable Curvature $textbfE$stimates), a suite of sketching-based preconditioned gradient algorithms. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha.
arXiv Detail & Related papers (2023-09-05T07:49:10Z)
Provable Robust Saliency-based Explanations [16.217374556142484]
We show that R2ET attains higher explanation robustness under stealthy attacks while retaining model accuracy. Experiments with a wide spectrum of network architectures and data modalities demonstrate that R2ET attains higher explanation robustness under stealthy attacks while retaining model accuracy.
arXiv Detail & Related papers (2022-12-28T22:05:32Z)
Estimating Principal Components under Adversarial Perturbations [25.778123431786653]
We study a natural model of robustness for high-dimensional statistical estimation problems. Our model is motivated by emerging paradigms such as low precision machine learning and adversarial training.
arXiv Detail & Related papers (2020-05-31T20:27:19Z)
Adaptive Control and Regret Minimization in Linear Quadratic Gaussian (LQG) Setting [91.43582419264763]
We propose LqgOpt, a novel reinforcement learning algorithm based on the principle of optimism in the face of uncertainty. LqgOpt efficiently explores the system dynamics, estimates the model parameters up to their confidence interval, and deploys the controller of the most optimistic model.
arXiv Detail & Related papers (2020-03-12T19:56:38Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
Improper Learning for Non-Stochastic Control [78.65807250350755]
We consider the problem of controlling a possibly unknown linear dynamical system with adversarial perturbations, adversarially chosen convex loss functions, and partially observed states. Applying online descent to this parametrization yields a new controller which attains sublinear regret vs. a large class of closed-loop policies. Our bounds are the first in the non-stochastic control setting that compete with emphall stabilizing linear dynamical controllers.
arXiv Detail & Related papers (2020-01-25T02:12:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.