Polynomial Regret Concentration of UCB for Non-Deterministic State Transitions
- URL: http://arxiv.org/abs/2502.06900v1
- Date: Sun, 09 Feb 2025 15:46:42 GMT
- Title: Polynomial Regret Concentration of UCB for Non-Deterministic State Transitions
- Authors: Can Cömer, Jannis Blüml, Cedric Derstroff, Kristian Kersting,
- Abstract summary: This paper extends the theoretical framework of Monte Carlo Tree Search (MCTS) to non-deterministic state transitions.
Our primary contribution is proving that these bounds also apply to non-deterministic environments, ensuring robust performance in settings.
This broadens the applicability of MCTS to real-world decision-making problems with outcomes, such as in autonomous systems and financial decision-making.
- Score: 18.347534843178355
- License:
- Abstract: Monte Carlo Tree Search (MCTS) has proven effective in solving decision-making problems in perfect information settings. However, its application to stochastic and imperfect information domains remains limited. This paper extends the theoretical framework of MCTS to stochastic domains by addressing non-deterministic state transitions, where actions lead to probabilistic outcomes. Specifically, building on the work of Shah et al. (2020), we derive polynomial regret concentration bounds for the Upper Confidence Bound algorithm in multi-armed bandit problems with stochastic transitions, offering improved theoretical guarantees. Our primary contribution is proving that these bounds also apply to non-deterministic environments, ensuring robust performance in stochastic settings. This broadens the applicability of MCTS to real-world decision-making problems with probabilistic outcomes, such as in autonomous systems and financial decision-making.
Related papers
- SAUP: Situation Awareness Uncertainty Propagation on LLM Agent [52.444674213316574]
Large language models (LLMs) integrated into multistep agent systems enable complex decision-making processes across various applications.
Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multistep decision-making process and the dynamic interactions between agents and their environments.
We propose SAUP, a novel framework that propagates uncertainty through each step of an LLM-based agent's reasoning process.
arXiv Detail & Related papers (2024-12-02T01:31:13Z) - Monte Carlo Planning for Stochastic Control on Constrained Markov Decision Processes [1.445706856497821]
This work defines an MDP framework, the textttSD-MDP, where we disentangle the causal structure of MDPs' transition and reward dynamics.
We derive theoretical guarantees on the estimation error of the value function under an optimal policy by allowing independent value estimation from Monte Carlo sampling.
arXiv Detail & Related papers (2024-06-23T16:22:40Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Q-Learning for Stochastic Control under General Information Structures
and Non-Markovian Environments [1.90365714903665]
We present a convergence theorem for iterations, and iterate in particular, Q-learnings under a general, possibly non-Markovian, environment.
We discuss the implications and applications of this theorem to a variety of control problems with non-Markovian environments.
arXiv Detail & Related papers (2023-10-31T19:53:16Z) - Fully Stochastic Trust-Region Sequential Quadratic Programming for
Equality-Constrained Optimization Problems [62.83783246648714]
We propose a sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with objectives and deterministic equality constraints.
The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices.
arXiv Detail & Related papers (2022-11-29T05:52:17Z) - Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities.
We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability.
The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z) - Robustness Guarantees for Credal Bayesian Networks via Constraint
Relaxation over Probabilistic Circuits [16.997060715857987]
We develop a method to quantify the robustness of decision functions with respect to credal Bayesian networks.
We show how to obtain a guaranteed upper bound on MARmax in linear time in the size of the circuit.
arXiv Detail & Related papers (2022-05-11T22:37:07Z) - Identification of Unexpected Decisions in Partially Observable
Monte-Carlo Planning: a Rule-Based Approach [78.05638156687343]
We propose a methodology for analyzing POMCP policies by inspecting their traces.
The proposed method explores local properties of policy behavior to identify unexpected decisions.
We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to mobile robot navigation.
arXiv Detail & Related papers (2020-12-23T15:09:28Z) - Stein Variational Model Predictive Control [130.60527864489168]
Decision making under uncertainty is critical to real-world, autonomous systems.
Model Predictive Control (MPC) methods have demonstrated favorable performance in practice, but remain limited when dealing with complex distributions.
We show that this framework leads to successful planning in challenging, non optimal control problems.
arXiv Detail & Related papers (2020-11-15T22:36:59Z) - Optimal Inspection and Maintenance Planning for Deteriorating Structural
Components through Dynamic Bayesian Networks and Markov Decision Processes [0.0]
Partially Observable Markov Decision Processes (POMDPs) provide a mathematical methodology for optimal control under uncertain action outcomes and observations.
We provide the formulation for developing both infinite and finite horizon POMDPs in a structural reliability context.
Results show that POMDPs achieve substantially lower costs as compared to their counterparts, even for traditional problem settings.
arXiv Detail & Related papers (2020-09-09T20:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.