Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning
- URL: http://arxiv.org/abs/2405.13609v1
- Date: Wed, 22 May 2024 13:01:37 GMT
- Title: Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning
- Authors: Maximilian Nägele, Jan Olle, Thomas Fösel, Remmy Zen, Florian Marquardt,
- Abstract summary: We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs.
This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs.
We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Markov decision processes (MDPs) are used to model a wide variety of applications ranging from game playing over robotics to finance. Their optimal policy typically maximizes the expected sum of rewards given at each step of the decision process. However, a large class of problems does not fit straightforwardly into this framework: Non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expected value of an arbitrary function of the rewards is maximized. Example functions include the maximum of the rewards or their mean divided by their standard deviation. In this work, we introduce a general mapping of NCMDPs to standard MDPs. This allows all techniques developed to find optimal policies for MDPs, such as reinforcement learning or dynamic programming, to be directly applied to the larger class of NCMDPs. Focusing on reinforcement learning, we show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems. Given our approach, we can improve both final performance and training time compared to relying on standard MDPs.
Related papers
- Aligning CodeLLMs with Direct Preference Optimization [44.34483822102872]
This work first identifies that the commonly used PPO algorithm may be suboptimal for the alignment of CodeLLM.
Based on only preference data pairs, DPO can render the model rank data automatically, giving rise to a fine-grained rewarding pattern.
Studies show that our method significantly improves the performance of existing CodeLLMs on benchmarks such as MBPP and HumanEval.
arXiv Detail & Related papers (2024-10-24T09:36:13Z) - Stochastic Bilevel Optimization with Lower-Level Contextual Markov Decision Processes [42.22085862132403]
We introduce Bilevel Optimization with Contextual Markov Decision Processes (BO-CMDP), a bilevel decision-making model.
BO-CMDP can be viewed as a Stackelberg Game where the leader and a random context beyond the leader's control together decide the setup of (many) MDPs.
We propose a Hyper Policy Descent (HPGD) algorithm to solve BO-CMDP, and demonstrate its convergence.
arXiv Detail & Related papers (2024-06-03T17:54:39Z) - Beyond Average Return in Markov Decision Processes [49.157108194438635]
We prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL).
We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations.
arXiv Detail & Related papers (2023-10-31T08:36:41Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Sample-Efficient Multi-Objective Learning via Generalized Policy
Improvement Prioritization [8.836422771217084]
Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences.
We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes.
We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks.
arXiv Detail & Related papers (2023-01-18T20:54:40Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Sequential Information Design: Markov Persuasion Process and Its
Efficient Reinforcement Learning [156.5667417159582]
This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs)
Planning in MPPs faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender.
We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles.
arXiv Detail & Related papers (2022-02-22T05:41:43Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Reward is enough for convex MDPs [30.478950691312715]
We study convex MDPs in which goals are expressed as convex functions of the stationary distribution.
We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
arXiv Detail & Related papers (2021-06-01T17:46:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.