Related papers: Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead

Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead

URL: http://arxiv.org/abs/2506.07054v1
Date: Sun, 08 Jun 2025 09:28:11 GMT
Title: Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead
Authors: Uri Koren, Navdeep Kumar, Uri Gadot, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor,
Abstract summary: Policy Gradient with Tree Search (PGTS) is an approach that integrates an $m$-step lookahead mechanism to enhance policy optimization.<n>We provide theoretical analysis demonstrating that increasing the tree search depth $m$-monotonically reduces the set of undesirable stationary points.<n> Empirical evaluations on diverse MDP structures, including Ladder, Tightrope, and Gridworld environments, illustrate PGTS's ability to exhibit "farsightedness"
Score: 45.63877278757336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Classical policy gradient (PG) methods in reinforcement learning frequently converge to suboptimal local optima, a challenge exacerbated in large or complex environments. This work investigates Policy Gradient with Tree Search (PGTS), an approach that integrates an $m$-step lookahead mechanism to enhance policy optimization. We provide theoretical analysis demonstrating that increasing the tree search depth $m$-monotonically reduces the set of undesirable stationary points and, consequently, improves the worst-case performance of any resulting stationary policy. Critically, our analysis accommodates practical scenarios where policy updates are restricted to states visited by the current policy, rather than requiring updates across the entire state space. Empirical evaluations on diverse MDP structures, including Ladder, Tightrope, and Gridworld environments, illustrate PGTS's ability to exhibit "farsightedness," navigate challenging reward landscapes, escape local traps where standard PG fails, and achieve superior solutions.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Provable Zero-Shot Generalization in Offline Reinforcement Learning [55.169228792596805]
We study offline reinforcement learning with zero-shot generalization property (ZSG)<n>Existing work showed that classical offline RL fails to generalize to new, unseen environments.<n>We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG.
arXiv Detail & Related papers (2025-03-11T02:44:32Z)
Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces. We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition. Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Gradient Informed Proximal Policy Optimization [35.22712034665224]
We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. By adaptively modifying the alpha value, we can effectively manage the influence of analytical policy gradients during learning. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments.
arXiv Detail & Related papers (2023-12-14T07:50:21Z)
Fractal Landscapes in Policy Optimization [18.676605033973136]
Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. We propose a framework for understanding one inherent limitation of the policy gradient approach. The optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs.
arXiv Detail & Related papers (2023-10-24T00:22:19Z)
Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control [24.470904615201736]
We study the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy.
arXiv Detail & Related papers (2023-09-26T01:03:54Z)
Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE. Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)
Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.