Reinforcement Learning via Fenchel-Rockafellar Duality
- URL: http://arxiv.org/abs/2001.01866v2
- Date: Thu, 9 Jan 2020 19:08:09 GMT
- Title: Reinforcement Learning via Fenchel-Rockafellar Duality
- Authors: Ofir Nachum, Bo Dai
- Abstract summary: We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality.
We summarize how this duality may be applied to a variety of reinforcement learning settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards.
- Score: 97.86417365464068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We review basic concepts of convex duality, focusing on the very general and
supremely useful Fenchel-Rockafellar duality. We summarize how this duality may
be applied to a variety of reinforcement learning (RL) settings, including
policy evaluation or optimization, online or offline learning, and discounted
or undiscounted rewards. The derivations yield a number of intriguing results,
including the ability to perform policy evaluation and on-policy policy
gradient with behavior-agnostic offline data and methods to learn a policy via
max-likelihood optimization. Although many of these results have appeared
previously in various forms, we provide a unified treatment and perspective on
these results, which we hope will enable researchers to better use and apply
the tools of convex duality to make further progress in RL.
Related papers
- Robust off-policy Reinforcement Learning via Soft Constrained Adversary [0.7583052519127079]
We introduce an f-divergence constrained problem with the prior knowledge distribution.
We derive two typical attacks and their corresponding robust learning frameworks.
Results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.
arXiv Detail & Related papers (2024-08-31T11:13:33Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Combing Policy Evaluation and Policy Improvement in a Unified
f-Divergence Framework [33.90259939664709]
We study the f-divergence between learning policy and sampling policy and derive a novel DRL framework, termed f-Divergence Reinforcement Learning (FRL)
The FRL framework achieves two advantages: (1) policy evaluation and policy improvement processes are derived simultaneously by f-divergence; (2) overestimation issue of value function are alleviated.
arXiv Detail & Related papers (2021-09-24T10:20:46Z) - Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement
Learning [7.020079427649125]
We show that grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance.
We propose a probabilistic mixture-of-experts (PMOE) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem.
arXiv Detail & Related papers (2021-04-19T08:21:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.