Massively Scaling Explicit Policy-conditioned Value Functions
- URL: http://arxiv.org/abs/2502.11949v1
- Date: Mon, 17 Feb 2025 16:02:54 GMT
- Title: Massively Scaling Explicit Policy-conditioned Value Functions
- Authors: Nico Bohlinger, Jan Peters,
- Abstract summary: We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs)<n>EPVFs learn a value function V(theta) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy.<n>We show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines.
- Score: 16.387595437722613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V({\theta}) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.
Related papers
- Learning Policy Representations for Steerable Behavior Synthesis [80.4542176039074]
Given a Markov decision process (MDP), we seek to learn representations for a range of policies to facilitate behavior steering at test time.<n>We show that these representations can be approximated uniformly for a range of policies using a set-based architecture.<n>We use variational generative approach to induce a smooth latent space, and further shape it with contrastive learning so that latent distances align with differences in value functions.
arXiv Detail & Related papers (2026-01-29T21:52:06Z) - Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions [31.697208397735395]
Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness.<n>We propose a solver-induced emphlatent spherical flow policy that brings the expressiveness of modern generative policies to the RL while guaranteeing feasibility by design.<n>Our approach outperforms state-of-the-art baselines by an average of 20.6% across a range of challenging RL tasks.
arXiv Detail & Related papers (2026-01-29T18:49:07Z) - Relative Entropy Pathwise Policy Optimization [56.86405621176669]
We show how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data.<n>We propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z) - Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.<n>DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.<n>We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - SAPG: Split and Aggregate Policy Gradients [37.433915947580076]
We propose a new on-policy RL algorithm that can effectively leverage large-scale environments by splitting them into chunks and fusing them back together via importance sampling.
Our algorithm, termed SAPG, shows significantly higher performance across a variety of challenging environments where vanilla PPO and other strong baselines fail to achieve high performance.
arXiv Detail & Related papers (2024-07-29T17:59:50Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - Safe Policy Improvement for POMDPs via Finite-State Controllers [6.022036788651133]
We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs)
SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner.
We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability.
arXiv Detail & Related papers (2023-01-12T11:22:54Z) - Improved Policy Optimization for Online Imitation Learning [17.450401609682544]
We consider online imitation learning (OIL), where the task is to find a policy that imitates the behavior of an expert via active interaction with the environment.
arXiv Detail & Related papers (2022-07-29T22:02:14Z) - A general class of surrogate functions for stable and efficient
reinforcement learning [45.31904153659212]
We propose a general framework based on functional mirror ascent that gives rise to an entire family of surrogate functions.
We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions.
The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate.
arXiv Detail & Related papers (2021-08-12T16:19:19Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.