Inverse Delayed Reinforcement Learning
- URL: http://arxiv.org/abs/2412.02931v1
- Date: Wed, 04 Dec 2024 00:53:55 GMT
- Title: Inverse Delayed Reinforcement Learning
- Authors: Simon Sinong Zhan, Qingyuan Wu, Zhian Ruan, Frank Yang, Philip Wang, Yixuan Wang, Ruochen Jiao, Chao Huang, Qi Zhu,
- Abstract summary: Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a variety of imitation tasks.
We introduce an IRL framework designed to extract rewarding features from expert trajectories affected by delayed disturbances.
- Score: 10.317802812959808
- License:
- Abstract: Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a variety of imitation tasks. In this paper, we introduce an IRL framework designed to extract rewarding features from expert trajectories affected by delayed disturbances. Instead of relying on direct observations, our approach employs an efficient off-policy adversarial training framework to derive expert features and recover optimal policies from augmented delayed observations. Empirical evaluations in the MuJoCo environment under diverse delay settings validate the effectiveness of our method. Furthermore, we provide a theoretical analysis showing that recovering expert policies from augmented delayed observations outperforms using direct delayed observations.
Related papers
- Reinforcement Learning from Delayed Observations via World Models [10.298219828693489]
In reinforcement learning settings, agents assume immediate feedback about the effects of their actions after taking them.
In practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms.
We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays.
arXiv Detail & Related papers (2024-03-18T23:18:27Z) - A Model-Based Approach for Improving Reinforcement Learning Efficiency
Leveraging Expert Observations [9.240917262195046]
We propose an algorithm that automatically adjusts the weights of each component in the augmented loss function.
Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks.
arXiv Detail & Related papers (2024-02-29T03:53:02Z) - ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization [52.5587113539404]
We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration.
Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks.
arXiv Detail & Related papers (2024-02-22T13:22:06Z) - Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning [11.084321518414226]
We adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of so-called hindsight policy methods.
Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
arXiv Detail & Related papers (2023-07-21T20:54:52Z) - IL-flOw: Imitation Learning from Observation using Normalizing Flows [28.998176144874193]
We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only.
Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods.
arXiv Detail & Related papers (2022-05-19T00:05:03Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Learning "What-if" Explanations for Sequential Decision-Making [92.8311073739295]
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior is essential.
We propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes.
We highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
arXiv Detail & Related papers (2020-07-02T14:24:17Z) - On the Loss Landscape of Adversarial Training: Identifying Challenges
and How to Overcome Them [57.957466608543676]
We analyze the influence of adversarial training on the loss landscape of machine learning models.
We show that the adversarial loss landscape is less favorable to optimization, due to increased curvature and more scattered gradients.
arXiv Detail & Related papers (2020-06-15T13:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.