Reinforcement Learning with Delayed, Composite, and Partially Anonymous
Reward
- URL: http://arxiv.org/abs/2305.02527v2
- Date: Mon, 28 Aug 2023 15:52:36 GMT
- Title: Reinforcement Learning with Delayed, Composite, and Partially Anonymous
Reward
- Authors: Washim Uddin Mondal and Vaneet Aggarwal
- Abstract summary: We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback.
The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components.
- Score: 41.61653528766776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate an infinite-horizon average reward Markov Decision Process
(MDP) with delayed, composite, and partially anonymous reward feedback. The
delay and compositeness of rewards mean that rewards generated as a result of
taking an action at a given state are fragmented into different components, and
they are sequentially realized at delayed time instances. The partial anonymity
attribute implies that a learner, for each state, only observes the aggregate
of past reward components generated as a result of different actions taken at
that state, but realized at the observation instance. We propose an algorithm
named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and
show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} +
d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action
spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper
bounded by the maximum reward delay, and $T$ denotes the time horizon. This
demonstrates the optimality of the bound in the order of $T$, and an additive
impact of the delay.
Related papers
- Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning [8.140056861479176]
We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps.<n>We propose an algorithm that combines the augmentation method and the upper confidence bound approach.
arXiv Detail & Related papers (2026-03-03T19:52:24Z) - Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)
Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.
We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - Communication-Constrained Bandits under Additive Gaussian Noise [111.06688156723018]
We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback.
We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
arXiv Detail & Related papers (2023-04-25T09:31:20Z) - Reinforcement Learning in a Birth and Death Process: Breaking the
Dependence on the State Space [0.0]
We revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure.
In our main result, we show that the regret of a slightly-ted version of the classical learning algorithm sc Ucrl2 is in fact upper bounded by $tildemathcalO(sqrtEAT)$ where $E$ is related to the weighted second moment of the stationary measure of a reference policy.
arXiv Detail & Related papers (2023-02-21T13:28:37Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.