Combing Policy Evaluation and Policy Improvement in a Unified
f-Divergence Framework
- URL: http://arxiv.org/abs/2109.11867v1
- Date: Fri, 24 Sep 2021 10:20:46 GMT
- Title: Combing Policy Evaluation and Policy Improvement in a Unified
f-Divergence Framework
- Authors: Chen Gong, Qiang He, Yunpeng Bai, Xiaoyu Chen, Xinwen Hou, Yu Liu,
Guoliang Fan
- Abstract summary: We study the f-divergence between learning policy and sampling policy and derive a novel DRL framework, termed f-Divergence Reinforcement Learning (FRL)
The FRL framework achieves two advantages: (1) policy evaluation and policy improvement processes are derived simultaneously by f-divergence; (2) overestimation issue of value function are alleviated.
- Score: 33.90259939664709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The framework of deep reinforcement learning (DRL) provides a powerful and
widely applicable mathematical formalization for sequential decision-making. In
this paper, we start from studying the f-divergence between learning policy and
sampling policy and derive a novel DRL framework, termed f-Divergence
Reinforcement Learning (FRL). We highlight that the policy evaluation and
policy improvement phases are induced by minimizing f-divergence between
learning policy and sampling policy, which is distinct from the conventional
DRL algorithm objective that maximizes the expected cumulative rewards.
Besides, we convert this framework to a saddle-point optimization problem with
a specific f function through Fenchel conjugate, which consists of policy
evaluation and policy improvement. Then we derive new policy evaluation and
policy improvement methods in FRL. Our framework may give new insights for
analyzing DRL algorithms. The FRL framework achieves two advantages: (1) policy
evaluation and policy improvement processes are derived simultaneously by
f-divergence; (2) overestimation issue of value function are alleviated. To
evaluate the effectiveness of the FRL framework, we conduct experiments on
Atari 2600 video games, which show that our framework matches or surpasses the
DRL algorithms we tested.
Related papers
- Reflective Policy Optimization [20.228281670899204]
Reflective Policy Optimization (RPO) amalgamates past and future state-action information for policy optimization.
RPO empowers the agent for introspection, allowing modifications to its actions within the current state.
Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks.
arXiv Detail & Related papers (2024-06-06T01:46:49Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Counterfactual Explanation Policies in RL [3.674863913115432]
COUNTERPOL is the first framework to analyze Reinforcement Learning policies using counterfactual explanations.
We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL.
arXiv Detail & Related papers (2023-07-25T01:14:56Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Expert-Supervised Reinforcement Learning for Offline Policy Learning and
Evaluation [21.703965401500913]
We propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning.
In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and 3) we propose a way to interpret ESRL's policy at every state through posterior distributions.
arXiv Detail & Related papers (2020-06-23T17:43:44Z) - Reinforcement Learning [36.664136621546575]
Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains.
In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy.
arXiv Detail & Related papers (2020-05-29T06:53:29Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL)
In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z) - Reinforcement Learning via Fenchel-Rockafellar Duality [97.86417365464068]
We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality.
We summarize how this duality may be applied to a variety of reinforcement learning settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards.
arXiv Detail & Related papers (2020-01-07T02:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.