Related papers: Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees

Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees

URL: http://arxiv.org/abs/2508.05441v1
Date: Thu, 07 Aug 2025 14:31:22 GMT
Title: Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees
Authors: Zuyuan Zhang, Arnob Ghosh, Tian Lan,
Abstract summary: Making decisions with respect to just the expected returns in Monte Carlo Tree Search cannot account for the potential range of high-risk, adverse outcomes associated with a decision.<n>These approaches fail to provide rigorous tail-safety guarantees with respect to extreme or high-risk outcomes.<n>This paper addresses the problem by developing two novel solutions.
Score: 22.591358065741357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Making decisions with respect to just the expected returns in Monte Carlo Tree Search (MCTS) cannot account for the potential range of high-risk, adverse outcomes associated with a decision. To this end, safety-aware MCTS often consider some constrained variants -- by introducing some form of mean risk measures or hard cost thresholds. These approaches fail to provide rigorous tail-safety guarantees with respect to extreme or high-risk outcomes (denoted as tail-risk), potentially resulting in serious consequence in high-stake scenarios. This paper addresses the problem by developing two novel solutions. We first propose CVaR-MCTS, which embeds a coherent tail risk measure, Conditional Value-at-Risk (CVaR), into MCTS. Our CVaR-MCTS with parameter $\alpha$ achieves explicit tail-risk control over the expected loss in the "worst $(1-\alpha)\%$ scenarios." Second, we further address the estimation bias of tail-risk due to limited samples. We propose Wasserstein-MCTS (or W-MCTS) by introducing a first-order Wasserstein ambiguity set $\mathcal{P}_{\varepsilon_{s}}(s,a)$ with radius $\varepsilon_{s}$ to characterize the uncertainty in tail-risk estimates. We prove PAC tail-safety guarantees for both CVaR-MCTS and W-MCTS and establish their regret. Evaluations on diverse simulated environments demonstrate that our proposed methods outperform existing baselines, effectively achieving robust tail-risk guarantees with improved rewards and stability.

Related papers

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Risk Estimation in a Markov Cost Process: Lower and Upper Bounds [3.1484174280822845]
We tackle the problem of estimating risk measures of the infinite-horizon discounted cost within a Markov cost process. The risk measures we study include variance, Value-at-Risk (VaR), and Conditional Value-at-Risk (CVaR)
arXiv Detail & Related papers (2023-10-17T16:35:39Z)
Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures [10.221369785560785]
In this paper, we consider the problem of maximizing dynamic risk of a sequence of rewards in Markov Decision Processes (MDPs) Using a convex combination of expectation and conditional value-at-risk (CVaR) as a special one-step conditional risk measure, we reformulate the risk-averse MDP as a risk-neutral counterpart with augmented action space and manipulation on the immediate rewards. Our numerical studies show that the risk-averse setting can reduce the variance and enhance robustness of the results.
arXiv Detail & Related papers (2023-01-14T21:43:18Z)
A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits [27.058087400790555]
We study the multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies.
arXiv Detail & Related papers (2022-06-07T02:10:30Z)
Efficient Risk-Averse Reinforcement Learning [79.61412643761034]
In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks.
arXiv Detail & Related papers (2022-05-10T19:40:52Z)
Risk-Averse Stochastic Shortest Path Planning [25.987787625028204]
We show that optimal, stationary, Markovian policies exist and can be found via a special Bellman's equation. A rover navigation MDP is used to illustrate the proposed methodology with conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) coherent risk measures.
arXiv Detail & Related papers (2021-03-26T20:49:14Z)
Risk-Constrained Thompson Sampling for CVaR Bandits [82.47796318548306]
We consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR) We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure.
arXiv Detail & Related papers (2020-11-16T15:53:22Z)
Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
Optimal Best-Arm Identification Methods for Tail-Risk Measures [9.128264779870538]
Conditional value-at-risk (CVaR) and value-at-risk (VaR) are popular tail-risk measures in finance and insurance industries. We identify the smallest CVaR, VaR, or sum of CVaR and mean from amongst finitely that has smallest CVaR, VaR, or sum of CVaR and mean.
arXiv Detail & Related papers (2020-08-17T20:23:24Z)
Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret [115.85354306623368]
We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels. We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ) We prove that RSVI attains an $tildeObig(lambda(|beta| H2) cdot sqrtH3 S2AT big)$ regret, while RSQ attains an $tildeObig(lambda
arXiv Detail & Related papers (2020-06-22T19:28:26Z)
Learning Bounds for Risk-sensitive Learning [86.50262971918276]
In risk-sensitive learning, one aims to find a hypothesis that minimizes a risk-averse (or risk-seeking) measure of loss. We study the generalization properties of risk-sensitive learning schemes whose optimand is described via optimized certainty equivalents.
arXiv Detail & Related papers (2020-06-15T05:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.