Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff
Objectives in Countable MDPs
- URL: http://arxiv.org/abs/2107.03287v1
- Date: Thu, 1 Jul 2021 21:13:39 GMT
- Title: Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff
Objectives in Countable MDPs
- Authors: Richard Mayr, Eric Munday
- Abstract summary: We study countably infinite Markov decision processes (MDPs) with real-valued transition rewards.
For each payoff type, the objective is to maximize the probability that the $liminf$ is non-negative.
Some cases can be won with memoryless deterministic strategies, while others require a step counter, a reward counter, or both.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study countably infinite Markov decision processes (MDPs) with real-valued
transition rewards. Every infinite run induces the following sequences of
payoffs: 1. Point payoff (the sequence of directly seen transition rewards), 2.
Total payoff (the sequence of the sums of all rewards so far), and 3. Mean
payoff. For each payoff type, the objective is to maximize the probability that
the $\liminf$ is non-negative. We establish the complete picture of the
strategy complexity of these objectives, i.e., how much memory is necessary and
sufficient for $\varepsilon$-optimal (resp. optimal) strategies. Some cases can
be won with memoryless deterministic strategies, while others require a step
counter, a reward counter, or both.
Related papers
- Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - Taming Infinity one Chunk at a Time: Concisely Represented Strategies in One-Counter MDPs [2.7262923206583136]
We study a class of infinite MDPs: one-counter MDPs (OC-MDPs)
We consider two characteristic objectives: reaching a target state (state-reachability) and reaching a target state with counter value zero.
We introduce two natural classes of concisely represented strategies based on a (possibly infinite) partition of counter values in intervals.
arXiv Detail & Related papers (2025-03-02T08:32:17Z) - Mixing Any Cocktail with Limited Ingredients: On the Structure of Payoff Sets in Multi-Objective MDPs and its Impact on Randomised Strategies [0.0]
We study the structure of the set of expected payoff vectors of all strategies given a multi-dimensional payoff function.
For any payoff for which the expected payoff is finite under all strategies, any expected payoff can be obtained exactly by mixing finitely many strategies.
arXiv Detail & Related papers (2025-02-25T15:33:59Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Provably Efficient Offline Multi-agent Reinforcement Learning via
Strategy-wise Bonus [48.34563955829649]
We propose a strategy-wise concentration principle which builds a confidence interval for the joint strategy.
For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose an efficient algorithm.
All of our algorithms can naturally take a pre-specified strategy class $Pi$ as input and output a strategy close to the best strategy in $Pi$.
arXiv Detail & Related papers (2022-06-01T00:18:15Z) - Strategy Complexity of Point Payoff, Mean Payoff and Total Payoff
Objectives in Countable MDPs [0.0]
We study countably infinite Markov decision processes (MDPs) with real-valued transition rewards.
For each payoff type, the objective is to maximize the probability that the $liminf$ is non-negative.
Some cases can be won with memoryless deterministic strategies, while others require a step counter, a reward counter, or both.
arXiv Detail & Related papers (2022-03-10T19:17:05Z) - Under-Approximating Expected Total Rewards in POMDPs [68.8204255655161]
We consider the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP)
We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well.
arXiv Detail & Related papers (2022-01-21T16:43:03Z) - A Self-Penalizing Objective Function for Scalable Interaction Detection [2.208242292882514]
We tackle the problem of nonparametric variable selection with a computation on discovering interactions between variables.
The trick is to maximize parametrized nonparametric dependence measures which we call metric learning objectives.
arXiv Detail & Related papers (2020-11-24T17:07:49Z) - Finite Continuum-Armed Bandits [0.0]
We consider a situation where an agent has $T$ ressources to be allocated to a larger number of actions.
The goal of the agent is to maximize her cumulative reward.
arXiv Detail & Related papers (2020-10-23T08:48:45Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.