Reward is enough for convex MDPs
- URL: http://arxiv.org/abs/2106.00661v4
- Date: Fri, 2 Jun 2023 12:04:47 GMT
- Title: Reward is enough for convex MDPs
- Authors: Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins and Satinder
Singh
- Abstract summary: We study convex MDPs in which goals are expressed as convex functions of the stationary distribution.
We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
- Score: 30.478950691312715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximising a cumulative reward function that is Markov and stationary, i.e.,
defined over state-action pairs and independent of time, is sufficient to
capture many kinds of goals in a Markov decision process (MDP). However, not
all goals can be captured in this manner. In this paper we study convex MDPs in
which goals are expressed as convex functions of the stationary distribution
and show that they cannot be formulated using stationary reward functions.
Convex MDPs generalize the standard reinforcement learning (RL) problem
formulation to a larger framework that includes many supervised and
unsupervised RL problems, such as apprenticeship learning, constrained MDPs,
and so-called `pure exploration'. Our approach is to reformulate the convex MDP
problem as a min-max game involving policy and cost (negative reward)
`players', using Fenchel duality. We propose a meta-algorithm for solving this
problem and show that it unifies many existing algorithms in the literature.
Related papers
- Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs [56.237917407785545]
Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes.
Many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret.
We show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation.
arXiv Detail & Related papers (2024-10-31T16:07:22Z) - Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs.
This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs.
We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z) - On Practical Robust Reinforcement Learning: Practical Uncertainty Set
and Double-Agent Algorithm [11.748284119769039]
Robust reinforcement learning (RRL) aims at seeking a robust policy to optimize the worst case performance over an uncertainty set of Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-05-11T08:52:09Z) - Under-Approximating Expected Total Rewards in POMDPs [68.8204255655161]
We consider the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP)
We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well.
arXiv Detail & Related papers (2022-01-21T16:43:03Z) - Sample Efficient Reinforcement Learning In Continuous State Spaces: A
Perspective Beyond Linearity [50.38337893712897]
We introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions.
We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition.
We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.
arXiv Detail & Related papers (2021-06-15T00:06:59Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - A Relation Analysis of Markov Decision Process Frameworks [26.308541799686505]
We study the relation between different Decision Process (MDP) frameworks in the machine learning and econometrics literature.
We show that the entropy-regularized MDP is equivalent to a MDP model, and is strictly subsumed by the general regularized MDP.
arXiv Detail & Related papers (2020-08-18T09:27:26Z) - Exploration-Exploitation in Constrained MDPs [79.23623305214275]
We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs)
While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP.
While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
arXiv Detail & Related papers (2020-03-04T17:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.