Related papers: Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

URL: http://arxiv.org/abs/2503.01468v2
Date: Fri, 23 May 2025 13:45:49 GMT
Title: Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Authors: Abdullah Akgül, Gulcin Baykal, Manuel Haußmann, Melih Kandemir,
Abstract summary: Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms.<n>We show that performing on-policy reinforcement learning with an evidential critic provides both of these properties.<n>We name the resulting algorithm as $textit Evidential Proximal Policy Optimization (EPPO)$ due to the integral role of evidential uncertainty in both policy evaluation and policy improvement stages.
Score: 11.642505299142956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i) preserving the plasticity of the critic network, (ii) directed exploration for rapid adaptation to the changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both of these properties. The evidential design ensures a fast and sufficiently accurate approximation to the uncertainty around the state-value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by the change in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm as $\textit{ Evidential Proximal Policy Optimization (EPPO)}$ due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.

Related papers

Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Exploration and Adaptation in Non-Stationary Tasks with Diffusion Policies [0.0]
This paper investigates the application of Diffusion Policy in non-stationary, vision-based RL settings, specifically targeting environments where task dynamics and objectives evolve over time.<n>We apply Diffusion Policy -- which leverages iterative denoising to refine latent action representations-to benchmark environments including Procgen and PointMaze.<n>Our experiments demonstrate that, despite increased computational demands, Diffusion Policy consistently outperforms standard RL methods such as PPO and DQN, achieving higher mean and maximum rewards with reduced variability.
arXiv Detail & Related papers (2025-03-31T23:00:07Z)
Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery [3.549243565065057]
Imitation learning is a data-driven approach to learning policies from expert behavior.<n>It is prone to unreliable outcomes in out-of-sample (OOS) regions.<n>We propose a framework for learning policies modeled by contractive dynamical systems.
arXiv Detail & Related papers (2024-12-10T14:28:18Z)
Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards [8.455772877963792]
We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL)<n>We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments.
arXiv Detail & Related papers (2024-11-26T20:22:31Z)
Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations. This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z)
Rich-Observation Reinforcement Learning with Continuous Latent Dynamics [43.84391209459658]
We introduce a new theoretical framework, RichCLD (Rich-Observation RL with Continuous Latent Dynamics), in which the agent performs control based on high-dimensional observations. Our main contribution is a new algorithm for this setting that is provably statistically and computationally efficient.
arXiv Detail & Related papers (2024-05-29T17:02:49Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z)
Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief [3.0036519884678894]
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
arXiv Detail & Related papers (2022-10-13T03:14:36Z)
Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training. We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)
Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization [63.75188254377202]
Deep reinforcement learning algorithms can perform poorly in real-world tasks due to discrepancy between source and target environments. We propose a novel model-free actor-critic algorithm to learn robust policies without modeling the disturbance in advance. Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
arXiv Detail & Related papers (2021-12-20T13:13:05Z)
Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs. We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)
Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well. We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain. We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z)
Decoupling Value and Policy for Generalization in Reinforcement Learning [20.08992844616678]
We argue that more information is needed to accurately estimate the value function than to learn the optimal policy. We propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic. IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors.
arXiv Detail & Related papers (2021-02-20T12:40:11Z)
Dynamic Regret of Policy Optimization in Non-stationary Environments [120.01408308460095]
We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. We show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.
arXiv Detail & Related papers (2020-06-30T23:34:37Z)
Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy [119.12515258771302]
We show that a variant of PPOO equipped with over-parametrization converges to globally optimal networks. The key to our analysis is the iterate of infinite gradient under a notion of one-dimensional monotonicity, where the gradient and are instant by networks.
arXiv Detail & Related papers (2019-06-25T03:20:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.