Investigating Scale Independent UCT Exploration Factor Strategies
- URL: http://arxiv.org/abs/2510.21275v1
- Date: Fri, 24 Oct 2025 09:19:14 GMT
- Title: Investigating Scale Independent UCT Exploration Factor Strategies
- Authors: Robin Schmöcker, Christoph Schnell, Alexander Dockhorn,
- Abstract summary: The Upper Confidence Bounds For Trees algorithm is not agnostic to the reward scale of the game it is applied to.<n>In this paper, we evaluate various strategies for adaptively choosing the $lambda$, called $lambda$-strategies, that are agnostic to the game's reward scale.
- Score: 42.13843953705695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of $\{-1,0,1\}$ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant $\lambda$, called $\lambda$-strategies, that are agnostic to the game's reward scale. These $\lambda$-strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested $\lambda$-strategies, which is to choose $\lambda$ as $2 \cdot \sigma$ where $\sigma$ is the empirical standard deviation of all state-action pairs' Q-values of the search tree. This method outperforms existing $\lambda$-strategies across a wide range of tasks both in terms of a single parameter value and the peak performances obtained by optimizing all available parameters.
Related papers
- Scale-Invariant Fast Convergence in Games [67.02769061793619]
We develop learning dynamics that achieve fast convergence while being both scale-free and scale-invariant.<n>For two-player zero-sum games, we obtain scale-free and scale-invariant dynamics with external regret bounded by $tildeO(A_mathrmdiff)$.<n>For multiplayer general-sum games, scale-free learning is enabled also by a technique called doubling clipping, which clips observed gradients based on past observations.
arXiv Detail & Related papers (2026-02-12T11:57:20Z) - Grouped Satisficing Paths in Pure Strategy Games: a Topological Perspective [15.76917401735207]
A widely adopted principle in MARL algorithms is "win-stay, lose-shift", which dictates that an agent retains its current strategy if it achieves the best response.<n>This paper establishes a sufficient condition for such a property, and demonstrates that any finite-state Markov game, as well as any $N$-player game, guarantees the existence of a finite-length satisficing path.
arXiv Detail & Related papers (2025-09-27T07:07:27Z) - Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)<n>Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.<n>We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Provably Efficient Offline Multi-agent Reinforcement Learning via
Strategy-wise Bonus [48.34563955829649]
We propose a strategy-wise concentration principle which builds a confidence interval for the joint strategy.
For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose an efficient algorithm.
All of our algorithms can naturally take a pre-specified strategy class $Pi$ as input and output a strategy close to the best strategy in $Pi$.
arXiv Detail & Related papers (2022-06-01T00:18:15Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.