Stochastic Bilevel Optimization with Lower-Level Contextual Markov Decision Processes
- URL: http://arxiv.org/abs/2406.01575v1
- Date: Mon, 3 Jun 2024 17:54:39 GMT
- Title: Stochastic Bilevel Optimization with Lower-Level Contextual Markov Decision Processes
- Authors: Vinzenz Thoma, Barna Pasztor, Andreas Krause, Giorgia Ramponi, Yifan Hu,
- Abstract summary: We introduce Bilevel Optimization with Contextual Markov Decision Processes (BO-CMDP), a bilevel decision-making model.
BO-CMDP can be viewed as a Stackelberg Game where the leader and a random context beyond the leader's control together decide the setup of (many) MDPs.
We propose a Hyper Policy Descent (HPGD) algorithm to solve BO-CMDP, and demonstrate its convergence.
- Score: 42.22085862132403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In various applications, the optimal policy in a strategic decision-making problem depends both on the environmental configuration and exogenous events. For these settings, we introduce Bilevel Optimization with Contextual Markov Decision Processes (BO-CMDP), a stochastic bilevel decision-making model, where the lower level consists of solving a contextual Markov Decision Process (CMDP). BO-CMDP can be viewed as a Stackelberg Game where the leader and a random context beyond the leader's control together decide the setup of (many) MDPs that (potentially multiple) followers best respond to. This framework extends beyond traditional bilevel optimization and finds relevance in diverse fields such as model design for MDPs, tax design, reward shaping and dynamic mechanism design. We propose a stochastic Hyper Policy Gradient Descent (HPGD) algorithm to solve BO-CMDP, and demonstrate its convergence. Notably, HPGD only utilizes observations of the followers' trajectories. Therefore, it allows followers to use any training procedure and the leader to be agnostic of the specific algorithm used, which aligns with various real-world scenarios. We further consider the setting when the leader can influence the training of followers and propose an accelerated algorithm. We empirically demonstrate the performance of our algorithm.
Related papers
- Balancing Optimality and Diversity: Human-Centered Decision Making through Generative Curation [6.980546503227467]
We introduce a novel framework called generative curation, which optimize the true desirability of decision options by integrating both quantitative and qualitative aspects.
We propose two implementation approaches: a generative neural network architecture that produces a distribution $pi$ to efficiently sample a diverse set of near-optimal actions, and a sequential optimization method to iteratively generate solutions.
We validate our approach with extensive datasets, demonstrating its effectiveness in enhancing decision-making processes across a range of complex environments.
arXiv Detail & Related papers (2024-09-17T20:13:32Z) - Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs.
This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs.
We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Contextual Stochastic Bilevel Optimization [50.36775806399861]
We introduce contextual bilevel optimization (CSBO) -- a bilevel optimization framework with the lower-level problem minimizing an expectation on some contextual information and the upper-level variable.
It is important for applications such as meta-learning, personalized learning, end-to-end learning, and Wasserstein distributionally robustly optimization with side information (WDRO-SI)
arXiv Detail & Related papers (2023-10-27T23:24:37Z) - Policy Gradient With Serial Markov Chain Reasoning [10.152838128195468]
We introduce a new framework that performs decision-making in reinforcement learning as an iterative reasoning process.
We show our framework has several useful properties that are inherently missing from traditional RL.
Our resulting algorithm achieves state-of-the-art performance in popular Mujoco and DeepMind Control benchmarks.
arXiv Detail & Related papers (2022-10-13T06:15:29Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Towards Global Optimality in Cooperative MARL with the Transformation
And Distillation Framework [26.612749327414335]
Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL)
In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods.
We show that TAD-PPO can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.
arXiv Detail & Related papers (2022-07-12T06:59:13Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Planning in Markov Decision Processes with Gap-Dependent Sample
Complexity [48.98199700043158]
We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process.
We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability.
arXiv Detail & Related papers (2020-06-10T15:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.