Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement
Learning with Sub-optimal Demonstrations
- URL: http://arxiv.org/abs/2310.08823v1
- Date: Fri, 13 Oct 2023 02:38:35 GMT
- Title: Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement
Learning with Sub-optimal Demonstrations
- Authors: Lu Li, Yuxin Pan, Ruobing Chen, Jie Liu, Zilin Wang, Yu Liu, Zhiheng
Li
- Abstract summary: Inverse reinforcement learning (IRL) aims to explicitly infer an underlying reward function based on collected expert demonstrations.
We introduce the Distance-rank Aware Sequential Reward Learning (DRASRL) framework.
Our framework demonstrates significant performance improvements over previous SOTA methods.
- Score: 25.536792010283566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse reinforcement learning (IRL) aims to explicitly infer an underlying
reward function based on collected expert demonstrations. Considering that
obtaining expert demonstrations can be costly, the focus of current IRL
techniques is on learning a better-than-demonstrator policy using a reward
function derived from sub-optimal demonstrations. However, existing IRL
algorithms primarily tackle the challenge of trajectory ranking ambiguity when
learning the reward function. They overlook the crucial role of considering the
degree of difference between trajectories in terms of their returns, which is
essential for further removing reward ambiguity. Additionally, it is important
to note that the reward of a single transition is heavily influenced by the
context information within the trajectory. To address these issues, we
introduce the Distance-rank Aware Sequential Reward Learning (DRASRL)
framework. Unlike existing approaches, DRASRL takes into account both the
ranking of trajectories and the degrees of dissimilarity between them to
collaboratively eliminate reward ambiguity when learning a sequence of
contextually informed reward signals. Specifically, we leverage the distance
between policies, from which the trajectories are generated, as a measure to
quantify the degree of differences between traces. This distance-aware
information is then used to infer embeddings in the representation space for
reward learning, employing the contrastive learning technique. Meanwhile, we
integrate the pairwise ranking loss function to incorporate ranking information
into the latent features. Moreover, we resort to the Transformer architecture
to capture the contextual dependencies within the trajectories in the latent
space, leading to more accurate reward estimation. Through extensive
experimentation, our DRASRL framework demonstrates significant performance
improvements over previous SOTA methods.
Related papers
- Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment [62.05713042908654]
We introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges.
We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals.
Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD.
arXiv Detail & Related papers (2024-05-24T15:13:53Z) - Align Your Intents: Offline Imitation Learning via Optimal Transport [3.1728695158666396]
We show that an imitating agent can still learn the desired behavior merely from observing the expert.
In our method, AILOT, we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data.
We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks.
arXiv Detail & Related papers (2024-02-20T14:24:00Z) - Transductive Reward Inference on Graph [53.003245457089406]
We develop a reward inference method based on the contextual properties of information propagation on graphs.
We leverage both the available data and limited reward annotations to construct a reward propagation graph.
We employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data.
arXiv Detail & Related papers (2024-02-06T03:31:28Z) - A Study of Forward-Forward Algorithm for Self-Supervised Learning [65.268245109828]
We study the performance of forward-forward vs. backpropagation for self-supervised representation learning.
Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-supervised) training, the transfer performance is significantly lagging behind in all the studied settings.
arXiv Detail & Related papers (2023-09-21T10:14:53Z) - Curricular Subgoals for Inverse Reinforcement Learning [21.038691420095525]
Inverse Reinforcement Learning (IRL) aims to reconstruct the reward function from expert demonstrations to facilitate policy learning.
Existing IRL methods mainly focus on learning global reward functions to minimize the trajectory difference between the imitator and the expert.
We propose a novel Curricular Subgoal-based Inverse Reinforcement Learning framework, that explicitly disentangles one task with several local subgoals to guide agent imitation.
arXiv Detail & Related papers (2023-06-14T04:06:41Z) - CLARE: Conservative Model-Based Reward Learning for Offline Inverse
Reinforcement Learning [26.05184273238923]
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL)
We devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function.
Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy.
arXiv Detail & Related papers (2023-02-09T17:16:29Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Off-Dynamics Inverse Reinforcement Learning from Hetero-Domain [11.075036222901417]
We propose an approach for inverse reinforcement learning from hetero-domain which learns a reward function in the simulator, drawing on the demonstrations from the real world.
The intuition behind the method is that the reward function should not only be oriented to imitate the experts, but should encourage actions adjusted for the dynamics difference between the simulator and the real world.
arXiv Detail & Related papers (2021-10-21T19:23:15Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z) - f-IRL: Inverse Reinforcement Learning via State Marginal Matching [13.100127636586317]
We propose a method for learning the reward function (and the corresponding policy) to match the expert state density.
We present an algorithm, f-IRL, that recovers a stationary reward function from the expert density by gradient descent.
Our method outperforms adversarial imitation learning methods in terms of sample efficiency and the required number of expert trajectories.
arXiv Detail & Related papers (2020-11-09T19:37:48Z) - Learning Invariant Representations for Reinforcement Learning without
Reconstruction [98.33235415273562]
We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction.
Bisimulation metrics quantify behavioral similarity between states in continuous MDPs.
We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks.
arXiv Detail & Related papers (2020-06-18T17:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.