Bandit Linear Control
- URL: http://arxiv.org/abs/2007.00759v1
- Date: Wed, 1 Jul 2020 21:12:19 GMT
- Title: Bandit Linear Control
- Authors: Asaf Cassel (1), Tomer Koren ((1) School of Computer Science, Tel Aviv
University)
- Abstract summary: We consider the problem of controlling a known linear dynamical system under noise, adversarially chosen costs, and bandit feedback.
We present a new and efficient algorithm that, for strongly convex and smooth costs, obtains regret that grows with the square root of the time horizon $T$.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of controlling a known linear dynamical system under
stochastic noise, adversarially chosen costs, and bandit feedback. Unlike the
full feedback setting where the entire cost function is revealed after each
decision, here only the cost incurred by the learner is observed. We present a
new and efficient algorithm that, for strongly convex and smooth costs, obtains
regret that grows with the square root of the time horizon $T$. We also give
extensions of this result to general convex, possibly non-smooth costs, and to
non-stochastic system noise. A key component of our algorithm is a new
technique for addressing bandit optimization of loss functions with memory.
Related papers
- Tight Rates for Bandit Control Beyond Quadratics [2.961909021941052]
We develop an algorithm that achieves an.
tildeO(T)$ as optimal control for bandit non-stochastic smooth perturbation functions.
Our main contribution is an algorithm that achieves an.
tildeO(T)$ as optimal control for Bandit Convex (BCO) without memory.
arXiv Detail & Related papers (2024-10-01T18:35:08Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Rate-Optimal Online Convex Optimization in Adaptive Linear Control [0.0]
We consider the problem of controlling an unknown convex linear system under adversarially changing costs.
We present the first computationally-gret that attains an optimal linear hindsight function.
arXiv Detail & Related papers (2022-06-03T07:32:11Z) - Online Nonsubmodular Minimization with Delayed Costs: From Full
Information to Bandit Feedback [98.7678704343537]
We focus on a class of nonsubmodular functions with special structure, and prove regret guarantees for several variants of the online and approximate online bandit gradient descent algorithms.
We derive bounds for the agent's regret in the full information and bandit feedback setting, even if the delay between choosing a decision and receiving the incurred cost is unbounded.
arXiv Detail & Related papers (2022-05-15T08:27:12Z) - Efficient Online Linear Control with Stochastic Convex Costs and Unknown
Dynamics [0.0]
We present a computationally efficient algorithm that attains an optimal $sqrtT$ regret-rate against the best stabilizing linear controller.
In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm.
arXiv Detail & Related papers (2022-03-02T15:19:20Z) - A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian
Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants.
Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation.
We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Non-Stochastic Control with Bandit Feedback [30.33117611898598]
We give an efficient sublinear regret algorithm for a known or unknown system.
The main algorithmic difficulty is the dependence of the loss on past controls.
We propose an efficient algorithm for the general setting of bandit convex optimization for loss functions with memory.
arXiv Detail & Related papers (2020-08-12T18:40:00Z) - Making Non-Stochastic Control (Almost) as Easy as Stochastic [27.736345095024276]
We show that same regret rate is attainable even in considerably more general non-stochastic control model.
We attain the optimal $widetildemathcalO(sqrtT)$ regret when the dynamics are unknown to the learner.
arXiv Detail & Related papers (2020-06-10T16:00:14Z) - Logarithmic Regret for Adversarial Online Control [56.12283443161479]
We give the first algorithm with logarithmic regret for arbitrary adversarial disturbance sequences.
Our algorithm and analysis use a characterization for the offline control law to reduce the online control problem to (delayed) online learning.
arXiv Detail & Related papers (2020-02-29T06:29:19Z) - Improper Learning for Non-Stochastic Control [78.65807250350755]
We consider the problem of controlling a possibly unknown linear dynamical system with adversarial perturbations, adversarially chosen convex loss functions, and partially observed states.
Applying online descent to this parametrization yields a new controller which attains sublinear regret vs. a large class of closed-loop policies.
Our bounds are the first in the non-stochastic control setting that compete with emphall stabilizing linear dynamical controllers.
arXiv Detail & Related papers (2020-01-25T02:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.