Online Markov Decision Processes with Aggregate Bandit Feedback
- URL: http://arxiv.org/abs/2102.00490v1
- Date: Sun, 31 Jan 2021 16:49:07 GMT
- Title: Online Markov Decision Processes with Aggregate Bandit Feedback
- Authors: Alon Cohen, Haim Kaplan, Tomer Koren, Yishay Mansour
- Abstract summary: We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
- Score: 74.85532145498742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a novel variant of online finite-horizon Markov Decision Processes
with adversarially changing loss functions and initially unknown dynamics. In
each episode, the learner suffers the loss accumulated along the trajectory
realized by the policy chosen for the episode, and observes aggregate bandit
feedback: the trajectory is revealed along with the cumulative loss suffered,
rather than the individual losses encountered along the trajectory. Our main
result is a computationally efficient algorithm with $O(\sqrt{K})$ regret for
this setting, where $K$ is the number of episodes.
We establish this result via an efficient reduction to a novel bandit
learning setting we call Distorted Linear Bandits (DLB), which is a variant of
bandit linear optimization where actions chosen by the learner are
adversarially distorted before they are committed. We then develop a
computationally-efficient online algorithm for DLB for which we prove an
$O(\sqrt{T})$ regret bound, where $T$ is the number of time steps. Our
algorithm is based on online mirror descent with a self-concordant barrier
regularization that employs a novel increasing learning rate schedule.
Related papers
- In-Trajectory Inverse Reinforcement Learning: Learn Incrementally Before An Ongoing Trajectory Terminates [10.438810967483438]
Inverse reinforcement learning (IRL) aims to learn a reward function and a corresponding policy.
Current IRL works cannot learn incrementally from an ongoing trajectory because they have to wait to collect at least one complete trajectory to learn.
This paper considers the problem of learning a reward function and a corresponding policy while observing the initial state-action pair of an ongoing trajectory.
arXiv Detail & Related papers (2024-10-21T03:16:32Z) - Distributed Online Bandit Nonconvex Optimization with One-Point Residual Feedback via Dynamic Regret [10.700891331004799]
This paper considers the distributed online bandit optimization problem with non loss functions over a time-varying digraph.
Players select an adversary and then the adversary assigns an arbitrary non-linear loss function to this player.
The expected regret of our algorithms is comparable to existing algorithms that use two-point deviation.
arXiv Detail & Related papers (2024-09-24T02:37:33Z) - Online Reinforcement Learning in Markov Decision Process Using Linear
Programming [1.0878040851638]
We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and rewards drawn from some fixed but unknown distribution.
We devise a simple and efficient model-based algorithm that achieves $widetildeO(LXsqrtTA)$ regret with high probability.
arXiv Detail & Related papers (2023-03-31T22:21:41Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Online Prediction in Sub-linear Space [15.773280101995676]
We provide the first sub-linear space and sub-linear regret algorithm for online learning with expert advice (against an oblivious adversary)
We also demonstrate a separation between oblivious and (strong) adaptive adversaries by proving a linear memory lower bound of any sub-linear regret algorithm against an adaptive adversary.
arXiv Detail & Related papers (2022-07-16T16:15:39Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.