Related papers: Align Your Intents: Offline Imitation Learning via Optimal Transport

Align Your Intents: Offline Imitation Learning via Optimal Transport

URL: http://arxiv.org/abs/2402.13037v1
Date: Tue, 20 Feb 2024 14:24:00 GMT
Title: Align Your Intents: Offline Imitation Learning via Optimal Transport
Authors: Maksim Bobrin, Nazar Buzun, Dmitrii Krylov, Dmitry V. Dylov
Abstract summary: We show that an imitating agent can still learn the desired behavior merely from observing the expert. In our method, AILOT, we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms in the sparse-reward tasks.
Score: 3.466132008692413
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Offline reinforcement learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert's and the agent's trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms in the sparse-reward tasks.

Related papers

ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data [56.217490064597506]
We propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data. We use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.
arXiv Detail & Related papers (2025-03-23T21:24:33Z)
Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching [23.600285251963395]
In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimize the reward through repeated RL procedures. We propose a novel approach to IRL by direct policy optimization, exploiting a linear factorization of the return as the inner product of successor features and a reward vector.
arXiv Detail & Related papers (2024-11-11T14:05:50Z)
Hindsight Preference Learning for Offline Preference-based Reinforcement Learning [22.870967604847458]
Offline preference-based reinforcement learning (RL) focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset. We propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets.
arXiv Detail & Related papers (2024-07-05T12:05:37Z)
SEABO: A Simple Search-Based Method for Offline Imitation Learning [57.2723889718596]
offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets. We propose a simple yet effective search-based offline IL method, tagged SEABO. We show that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory.
arXiv Detail & Related papers (2024-02-06T08:48:01Z)
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z)
Optimal Transport for Offline Imitation Learning [31.218468923400373]
offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. We introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories. We show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards.
arXiv Detail & Related papers (2023-03-24T12:45:42Z)
Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning. Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration [9.017416068706579]
A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. We develop an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy. We demonstrate the superior performance of our algorithm over state-of-the-art approaches.
arXiv Detail & Related papers (2022-02-09T18:45:40Z)
RvS: What is Essential for Offline RL via Supervised Learning? [77.91045677562802]
Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. In every environment suite we consider simply maximizing likelihood with two-layer feedforward is competitive. They also probe the limits of existing RvS methods, which are comparatively weak on random data.
arXiv Detail & Related papers (2021-12-20T18:55:16Z)
Offline Meta-Reinforcement Learning with Online Self-Supervision [66.42016534065276]
We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy. Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data. We find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
arXiv Detail & Related papers (2021-07-08T17:01:32Z)
OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning [107.6943868812716]
In many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning.
arXiv Detail & Related papers (2020-10-26T14:31:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.