Related papers: Online Markov Decision Processes with Aggregate Bandit Feedback

Online Markov Decision Processes with Aggregate Bandit Feedback

URL: http://arxiv.org/abs/2102.00490v1
Date: Sun, 31 Jan 2021 16:49:07 GMT
Title: Online Markov Decision Processes with Aggregate Bandit Feedback
Authors: Alon Cohen, Haim Kaplan, Tomer Koren, Yishay Mansour
Abstract summary: We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback. Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
Score: 74.85532145498742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with the cumulative loss suffered, rather than the individual losses encountered along the trajectory. Our main result is a computationally efficient algorithm with $O(\sqrt{K})$ regret for this setting, where $K$ is the number of episodes. We establish this result via an efficient reduction to a novel bandit learning setting we call Distorted Linear Bandits (DLB), which is a variant of bandit linear optimization where actions chosen by the learner are adversarially distorted before they are committed. We then develop a computationally-efficient online algorithm for DLB for which we prove an $O(\sqrt{T})$ regret bound, where $T$ is the number of time steps. Our algorithm is based on online mirror descent with a self-concordant barrier regularization that employs a novel increasing learning rate schedule.

Related papers

Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information [57.287431079644705]
We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. We provide learning algorithms for the leader which achieve $O(T1/2)$ regret under bandit feedback.
arXiv Detail & Related papers (2025-01-31T22:40:57Z)
In-Trajectory Inverse Reinforcement Learning: Learn Incrementally Before An Ongoing Trajectory Terminates [10.438810967483438]
Inverse reinforcement learning (IRL) aims to learn a reward function and a corresponding policy. Current IRL works cannot learn incrementally from an ongoing trajectory because they have to wait to collect at least one complete trajectory to learn. This paper considers the problem of learning a reward function and a corresponding policy while observing the initial state-action pair of an ongoing trajectory.
arXiv Detail & Related papers (2024-10-21T03:16:32Z)
Distributed Online Bandit Nonconvex Optimization with One-Point Residual Feedback via Dynamic Regret [10.700891331004799]
This paper considers the distributed online bandit optimization problem with non loss functions over a time-varying digraph. Players select an adversary and then the adversary assigns an arbitrary non-linear loss function to this player. The expected regret of our algorithms is comparable to existing algorithms that use two-point deviation.
arXiv Detail & Related papers (2024-09-24T02:37:33Z)
Online Reinforcement Learning in Markov Decision Process Using Linear Programming [1.0878040851638]
We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and rewards drawn from some fixed but unknown distribution. We devise a simple and efficient model-based algorithm that achieves $widetildeO(LXsqrtTA)$ regret with high probability.
arXiv Detail & Related papers (2023-03-31T22:21:41Z)
Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret. We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z)
Online Prediction in Sub-linear Space [15.773280101995676]
We provide the first sub-linear space and sub-linear regret algorithm for online learning with expert advice (against an oblivious adversary) We also demonstrate a separation between oblivious and (strong) adaptive adversaries by proving a linear memory lower bound of any sub-linear regret algorithm against an adaptive adversary.
arXiv Detail & Related papers (2022-07-16T16:15:39Z)
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs) We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.