Related papers: Monte Carlo Tree Search Algorithms for Risk-Aware and Multi-Objective Reinforcement Learning

Monte Carlo Tree Search Algorithms for Risk-Aware and Multi-Objective Reinforcement Learning

URL: http://arxiv.org/abs/2211.13032v1
Date: Wed, 23 Nov 2022 15:33:19 GMT
Title: Monte Carlo Tree Search Algorithms for Risk-Aware and Multi-Objective Reinforcement Learning
Authors: Conor F. Hayes and Mathieu Reymond and Diederik M. Roijers and Enda Howley and Patrick Mannion
Abstract summary: In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. We propose two novel Monte Carlo tree search algorithms.
Score: 2.3449131636069898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.

Related papers

Beyond Expected Return: Accounting for Policy Reproducibility when Evaluating Reinforcement Learning Algorithms [9.649114720478872]
Many applications in Reinforcement Learning (RL) have noise ority present in the environment. These uncertainties lead the exact same policy to perform differently, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications.
arXiv Detail & Related papers (2023-12-12T11:22:31Z)
Rollout Heuristics for Online Stochastic Contingent Planning [6.185979230964809]
Partially Observable Monte-Carlo Planning is an online algorithm for deciding on the next action to perform. POMDP is highly dependent on the rollout policy to compute good estimates. In this paper, we model POMDPs as contingent planning problems.
arXiv Detail & Related papers (2023-10-03T18:24:47Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Multivariate Systemic Risk Measures and Computation by Deep Learning Algorithms [63.03966552670014]
We discuss the key related theoretical aspects, with a particular focus on the fairness properties of primal optima and associated risk allocations. The algorithms we provide allow for learning primals, optima for the dual representation and corresponding fair risk allocations.
arXiv Detail & Related papers (2023-02-02T22:16:49Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy. We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z)
Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models [2.7648976108201815]
Key to solving real-world problems is to exploit sparse dependency structures between agents. In wind farm control a trade-off exists between maximising power and minimising stress on the systems components. We model such sparse dependencies between agents as a multi-objective coordination graph (MO-CoG)
arXiv Detail & Related papers (2022-07-01T12:10:15Z)
Expected Scalarised Returns Dominance: A New Solution Concept for Multi-Objective Decision Making [4.117597517886004]
In many real-world scenarios, the utility of a user is derived from the single execution of a policy. To apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. We propose first-order dominance as a criterion to build solution sets to maximise expected utility. We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant.
arXiv Detail & Related papers (2021-06-02T09:42:42Z)
Risk Aware and Multi-Objective Decision Making with Distributional Monte Carlo Tree Search [3.487620847066216]
We propose an algorithm that learns a posterior distribution over the utility of the different possible returns attainable from individual policy executions. Our algorithm outperforms the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.
arXiv Detail & Related papers (2021-02-01T16:47:39Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.