Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space
- URL: http://arxiv.org/abs/2509.25876v1
- Date: Tue, 30 Sep 2025 07:13:55 GMT
- Title: Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space
- Authors: Xinyu Zhang, Aishik Deb, Klaus Mueller,
- Abstract summary: Policy-gradient methods such as PPO are updated along a single gradient direction, leaving the rich local structure of the parameter space unexplored.<n>Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape.<n>We introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO.
- Score: 15.65017469378437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.
Related papers
- IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction [107.49922328855025]
IterResearch is a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process.<n>It achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks.<n>It serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks.
arXiv Detail & Related papers (2025-11-10T17:30:08Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - Polychromic Objectives for Reinforcement Learning [63.37185057794815]
Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks.<n>We introduce an objective for policy methods that explicitly enforces the exploration and refinement of diverse generations.<n>We show how proximal policy optimization (PPO) can be adapted to optimize this objective.
arXiv Detail & Related papers (2025-09-29T19:32:11Z) - Stochastic Path Planning in Correlated Obstacle Fields [1.8184089804625951]
We introduce the Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain status.<n>We develop Bayesian belief updates that refine blockage probabilities, and use the posteriors to reduce search space for efficiency.<n>This framework addresses navigation challenges in environments with adversarial interruptions or clustered natural hazards.
arXiv Detail & Related papers (2025-09-23T20:30:35Z) - Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead [45.63877278757336]
Policy Gradient with Tree Search (PGTS) is an approach that integrates an $m$-step lookahead mechanism to enhance policy optimization.<n>We provide theoretical analysis demonstrating that increasing the tree search depth $m$-monotonically reduces the set of undesirable stationary points.<n> Empirical evaluations on diverse MDP structures, including Ladder, Tightrope, and Gridworld environments, illustrate PGTS's ability to exhibit "farsightedness"
arXiv Detail & Related papers (2025-06-08T09:28:11Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Identifying Policy Gradient Subspaces [42.75990181248372]
Policy gradient methods hold great potential for solving complex continuous control tasks.
Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace.
arXiv Detail & Related papers (2024-01-12T14:40:55Z) - Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control [24.470904615201736]
We study the return landscape: the mapping between a policy and a return.
We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns.
We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy.
arXiv Detail & Related papers (2023-09-26T01:03:54Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Entropy Augmented Reinforcement Learning [0.0]
We propose a shifted Markov decision process (MDP) to encourage the exploration and reinforce the ability of escaping from suboptimums.
Our experiments test augmented TRPO and PPO on MuJoCo benchmark tasks, of an indication that the agent is heartened towards higher reward regions.
arXiv Detail & Related papers (2022-08-19T13:09:32Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.