Forward and inverse reinforcement learning sharing network weights and
hyperparameters
- URL: http://arxiv.org/abs/2008.07284v2
- Date: Tue, 31 May 2022 11:07:58 GMT
- Title: Forward and inverse reinforcement learning sharing network weights and
hyperparameters
- Authors: Eiji Uchibe and Kenji Doya
- Abstract summary: ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process.
A forward RL step minimizes the reverse KL estimated by the inverse RL step.
We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy.
- Score: 3.705785916791345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes model-free imitation learning named Entropy-Regularized
Imitation Learning (ERIL) that minimizes the reverse Kullback-Leibler (KL)
divergence. ERIL combines forward and inverse reinforcement learning (RL) under
the framework of an entropy-regularized Markov decision process. An inverse RL
step computes the log-ratio between two distributions by evaluating two binary
discriminators. The first discriminator distinguishes the state generated by
the forward RL step from the expert's state. The second discriminator, which is
structured by the theory of entropy regularization, distinguishes the
state-action-next-state tuples generated by the learner from the expert ones.
One notable feature is that the second discriminator shares hyperparameters
with the forward RL, which can be used to control the discriminator's ability.
A forward RL step minimizes the reverse KL estimated by the inverse RL step. We
show that minimizing the reverse KL divergence is equivalent to finding an
optimal policy. Our experimental results on MuJoCo-simulated environments and
vision-based reaching tasks with a robotic arm show that ERIL is more
sample-efficient than the baseline methods. We apply the method to human
behaviors that perform a pole-balancing task and describe how the estimated
reward functions show how every subject achieves her goal.
Related papers
- Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.
The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - One-Step Distributional Reinforcement Learning [10.64435582017292]
We present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework.
We show that our approach comes with a unified theory for both policy evaluation and control.
We propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis.
arXiv Detail & Related papers (2023-04-27T06:57:00Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - The Nature of Temporal Difference Errors in Multi-step Distributional
Reinforcement Learning [46.85801978792022]
We study the multi-step off-policy learning approach to distributional RL.
We identify a novel notion of path-dependent distributional TD error.
We derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace.
arXiv Detail & Related papers (2022-07-15T16:19:23Z) - Branching Reinforcement Learning [16.437993672422955]
We propose a novel Branching Reinforcement Learning (Branching RL) model.
We investigate Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model.
This model finds important applications in hierarchical recommendation systems and online advertising.
arXiv Detail & Related papers (2022-02-16T11:19:03Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Cross-Trajectory Representation Learning for Zero-Shot Generalization in
RL [21.550201956884532]
generalize policies learned on a few tasks over a high-dimensional observation space to similar tasks not seen during training.
Many promising approaches to this challenge consider RL as a process of training two functions simultaneously.
We propose Cross-Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations.
arXiv Detail & Related papers (2021-06-04T00:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.