Related papers: Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery

Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery

URL: http://arxiv.org/abs/2410.07643v1
Date: Thu, 10 Oct 2024 06:21:32 GMT
Title: Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery
Authors: Yangchun Zhang, Wang Zhou, Yirui Zhou,
Abstract summary: adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions. This paper reexamines AIRL in light of the unobservable transition matrix or limited informative priors. We show that AIRL can disentangle rewards for effective transfer with high probability, irrespective of specific conditions.
Score: 1.1394969272703013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In scenarios of inverse reinforcement learning (IRL) with a single expert, adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions by restricting the reward class, e.g., to state-only rewards. However, AIRL faces practical challenges, primarily stemming from the difficulty of verifying the unobservable transition matrix - often encountered in practice - under the specific conditions necessary for effective transfer. This paper reexamines AIRL in light of the unobservable transition matrix or limited informative priors. By applying random matrix theory (RMT), we demonstrate that AIRL can disentangle rewards for effective transfer with high probability, irrespective of specific conditions. This perspective reframes inadequate transfer in certain contexts. Specifically, it is attributed to the selection problem of the reinforcement learning algorithm employed by AIRL, which is characterized by training variance. Based on this insight, we propose a hybrid framework that integrates on-policy proximal policy optimization (PPO) in the source environment with off-policy soft actor-critic (SAC) in the target environment, leading to significant improvements in reward transfer effectiveness.

Related papers

Principled Multimodal Representation Learning [70.60542106731813]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z)
Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery [19.37238379592233]
Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to subspace estimation.<n>This paper establishes that, under deterministic conditions, a variant IRLS with dynamic regularization converges linearly to the underlying subspace.<n>We extend these guarantees to subspace estimation, a setting that lacks prior recovery theory.
arXiv Detail & Related papers (2025-06-25T15:23:32Z)
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy. We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Gradients can train reward models: An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model [9.531082746970286]
We study the problem of estimating Dynamic Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning ( offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q*$ functions that govern agent behavior from offline behavior data. We propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards.
arXiv Detail & Related papers (2025-02-19T22:22:20Z)
Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization [10.465789490644031]
We propose a novel framework for robust regularized Markov decision process ($d$-RRMDP) For the offline RL setting, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI)
arXiv Detail & Related papers (2024-11-27T18:57:03Z)
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment [7.477559660351106]
imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with a demonstration. We propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment.
arXiv Detail & Related papers (2024-10-31T07:08:14Z)
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL) We identify the Nash equilibrium from a preference-only offline dataset in general-sum games. Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z)
DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training [13.866486498822228]
We propose discriminative reward co-training as an extension to deep reinforcement learning algorithms. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments.
arXiv Detail & Related papers (2023-01-18T10:42:00Z)
Learning Transferable Reward for Query Object Localization with Policy Adaptation [49.994989590997655]
We learn a transferable reward signal formulated using the exemplary set by ordinal metric learning. Our proposed method enables test-time policy adaptation to new environments where the reward signals are not readily available.
arXiv Detail & Related papers (2022-02-24T22:52:14Z)
A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching [12.335788185691916]
Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. Prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. We present Off-Policy Inverse Reinforcement Learning (OPIRL), which adopts off-policy data distribution instead of on-policy.
arXiv Detail & Related papers (2021-09-09T14:32:26Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit. We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner. Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z)
Supercharging Imbalanced Data Learning With Energy-based Contrastive Representation Transfer [72.5190560787569]
In computer vision, learning from long tailed datasets is a recurring theme, especially for natural image datasets. Our proposal posits a meta-distributional scenario, where the data generating mechanism is invariant across the label-conditional feature distributions. This allows us to leverage a causal data inflation procedure to enlarge the representation of minority classes.
arXiv Detail & Related papers (2020-11-25T00:13:11Z)
Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework. BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z)
Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL) Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it. Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z)
Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.