Related papers: Strategy Complexity of Point Payoff, Mean Payoff and Total Payoff Objectives in Countable MDPs

Strategy Complexity of Point Payoff, Mean Payoff and Total Payoff Objectives in Countable MDPs

URL: http://arxiv.org/abs/2203.07079v1
Date: Thu, 10 Mar 2022 19:17:05 GMT
Title: Strategy Complexity of Point Payoff, Mean Payoff and Total Payoff Objectives in Countable MDPs
Authors: Richard Mayr and Eric Munday
Abstract summary: We study countably infinite Markov decision processes (MDPs) with real-valued transition rewards. For each payoff type, the objective is to maximize the probability that the $liminf$ is non-negative. Some cases can be won with memoryless deterministic strategies, while others require a step counter, a reward counter, or both.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study countably infinite Markov decision processes (MDPs) with real-valued transition rewards. Every infinite run induces the following sequences of payoffs: 1. Point payoff (the sequence of directly seen transition rewards), 2. Mean payoff (the sequence of the sums of all rewards so far, divided by the number of steps), and 3. Total payoff (the sequence of the sums of all rewards so far). For each payoff type, the objective is to maximize the probability that the $\liminf$ is non-negative. We establish the complete picture of the strategy complexity of these objectives, i.e., how much memory is necessary and sufficient for $\varepsilon$-optimal (resp. optimal) strategies. Some cases can be won with memoryless deterministic strategies, while others require a step counter, a reward counter, or both.

Related papers

Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z)
Reasoning without Regret [4.07926531936425]
We introduce emphBackwards Adaptive Reward Shaping (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision.
arXiv Detail & Related papers (2025-04-14T00:34:20Z)
Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning. We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Taming Infinity one Chunk at a Time: Concisely Represented Strategies in One-Counter MDPs [2.7262923206583136]
We study a class of infinite MDPs: one-counter MDPs (OC-MDPs) We consider two characteristic objectives: reaching a target state (state-reachability) and reaching a target state with counter value zero. We introduce two natural classes of concisely represented strategies based on a (possibly infinite) partition of counter values in intervals.
arXiv Detail & Related papers (2025-03-02T08:32:17Z)
Mixing Any Cocktail with Limited Ingredients: On the Structure of Payoff Sets in Multi-Objective MDPs and its Impact on Randomised Strategies [0.0]
We study the structure of the set of expected payoff vectors of all strategies given a multi-dimensional payoff function. For any payoff for which the expected payoff is finite under all strategies, any expected payoff can be obtained exactly by mixing finitely many strategies.
arXiv Detail & Related papers (2025-02-25T15:33:59Z)
Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)<n>Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.<n>We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus [48.34563955829649]
We propose a strategy-wise concentration principle which builds a confidence interval for the joint strategy. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose an efficient algorithm. All of our algorithms can naturally take a pre-specified strategy class $Pi$ as input and output a strategy close to the best strategy in $Pi$.
arXiv Detail & Related papers (2022-06-01T00:18:15Z)
Under-Approximating Expected Total Rewards in POMDPs [68.8204255655161]
We consider the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well.
arXiv Detail & Related papers (2022-01-21T16:43:03Z)
Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs [0.0]
We study countably infinite Markov decision processes (MDPs) with real-valued transition rewards. For each payoff type, the objective is to maximize the probability that the $liminf$ is non-negative. Some cases can be won with memoryless deterministic strategies, while others require a step counter, a reward counter, or both.
arXiv Detail & Related papers (2021-07-01T21:13:39Z)
Finite Continuum-Armed Bandits [0.0]
We consider a situation where an agent has $T$ ressources to be allocated to a larger number of actions. The goal of the agent is to maximize her cumulative reward.
arXiv Detail & Related papers (2020-10-23T08:48:45Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)
On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.