Soft policy optimization using dual-track advantage estimator
- URL: http://arxiv.org/abs/2009.06858v1
- Date: Tue, 15 Sep 2020 04:09:29 GMT
- Title: Soft policy optimization using dual-track advantage estimator
- Authors: Yubo Huang, Xuechun Wang, Luobao Zou, Zhiwei Zhuang, Weidong Zhang
- Abstract summary: This paper introduces the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation.
We propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm.
Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method achieves the most advanced results in cumulative return.
- Score: 5.4020749513539235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning (RL), we always expect the agent to explore as many
states as possible in the initial stage of training and exploit the explored
information in the subsequent stage to discover the most returnable trajectory.
Based on this principle, in this paper, we soften the proximal policy
optimization by introducing the entropy and dynamically setting the temperature
coefficient to balance the opportunity of exploration and exploitation. While
maximizing the expected reward, the agent will also seek other trajectories to
avoid the local optimal policy. Nevertheless, the increase of randomness
induced by entropy will reduce the train speed in the early stage. Integrating
the temporal-difference (TD) method and the general advantage estimator (GAE),
we propose the dual-track advantage estimator (DTAE) to accelerate the
convergence of value functions and further enhance the performance of the
algorithm. Compared with other on-policy RL algorithms on the Mujoco
environment, the proposed method not only significantly speeds up the training
but also achieves the most advanced results in cumulative return.
Related papers
- Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning [5.325297567945828]
We propose a new method for two-time-scale optimization that achieves significantly faster convergence than the prior arts.
We characterize the proposed algorithm under various conditions and show how it specializes on online sample-based methods.
arXiv Detail & Related papers (2024-05-15T19:03:08Z) - How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories [8.429001045596687]
We represent the learning process of an RL algorithm as a sequence of policies generated during training.
We then study the policy trajectory induced in the manifold of state-action occupancy measures.
arXiv Detail & Related papers (2024-02-14T11:55:50Z) - Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z) - Adversarial Style Transfer for Robust Policy Optimization in Deep
Reinforcement Learning [13.652106087606471]
This paper proposes an algorithm that aims to improve generalization for reinforcement learning agents by removing overfitting to confounding features.
A policy network updates its parameters to minimize the effect of such perturbations, thus staying robust while maximizing the expected future reward.
We evaluate our approach on Procgen and Distracting Control Suite for generalization and sample efficiency.
arXiv Detail & Related papers (2023-08-29T18:17:35Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Proximal Deterministic Policy Gradient [20.951797549505986]
We introduce two techniques to improve off-policy Reinforcement Learning (RL) algorithms.
We exploit the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate.
We demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
arXiv Detail & Related papers (2020-08-03T10:19:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.