On Pathologies in KL-Regularized Reinforcement Learning from Expert
Demonstrations
- URL: http://arxiv.org/abs/2212.13936v1
- Date: Wed, 28 Dec 2022 16:29:09 GMT
- Title: On Pathologies in KL-Regularized Reinforcement Learning from Expert
Demonstrations
- Authors: Tim G. J. Rudner and Cong Lu and Michael A. Osborne and Yarin Gal and
Yee Whye Teh
- Abstract summary: We show that KL-regularized reinforcement learning with behavioral reference policies can suffer from pathological training dynamics.
We show that the pathology can be remedied by non-parametric behavioral reference policies.
- Score: 79.49929463310588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: KL-regularized reinforcement learning from expert demonstrations has proved
successful in improving the sample efficiency of deep reinforcement learning
algorithms, allowing them to be applied to challenging physical real-world
tasks. However, we show that KL-regularized reinforcement learning with
behavioral reference policies derived from expert demonstrations can suffer
from pathological training dynamics that can lead to slow, unstable, and
suboptimal online learning. We show empirically that the pathology occurs for
commonly chosen behavioral policy classes and demonstrate its impact on sample
efficiency and online policy performance. Finally, we show that the pathology
can be remedied by non-parametric behavioral reference policies and that this
allows KL-regularized reinforcement learning to significantly outperform
state-of-the-art approaches on a variety of challenging locomotion and
dexterous hand manipulation tasks.
Related papers
- Adaptive Paradigm Synergy: Can a Cross-Paradigm Objective Enhance Long-Tailed Learning? [16.110763554788445]
Self-supervised learning (SSL) has achieved impressive results across several computer vision tasks, even rivaling supervised methods.
However, its performance degrades on real-world datasets with long-tailed distributions due to difficulties in capturing inherent class imbalances.
We introduce Adaptive Paradigm Synergy (APS), a cross-paradigm objective that seeks to unify the strengths of both paradigms.
arXiv Detail & Related papers (2024-10-30T10:25:22Z) - REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning [7.889696505137217]
We propose Revealing Evolutionary Action Consequence Trajectories (REACT) to enhance the interpretability of Reinforcement Learning (RL)
In contrast to the prevalent practice of RL models based on their optimal behavior learned during training, we posit that considering a range of edge-case trajectories provides a more comprehensive understanding of their inherent behavior.
Our results highlight its effectiveness in revealing nuanced aspects of RL models' behavior beyond optimal performance, thereby contributing to improved interpretability.
arXiv Detail & Related papers (2024-04-04T10:56:30Z) - Automatic Curriculum Learning with Gradient Reward Signals [0.0]
We introduce a framework where the teacher model, utilizing the gradient norm information of a student model, dynamically adapts the learning curriculum.
We analyze how gradient norm rewards influence the teacher's ability to craft challenging yet achievable learning sequences, ultimately enhancing the student's performance.
arXiv Detail & Related papers (2023-12-21T04:19:43Z) - RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning.
Our proposed method uses reinforcement learning with user intervention signals themselves as rewards.
This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z) - Exploiting Symmetry and Heuristic Demonstrations in Off-policy
Reinforcement Learning for Robotic Manipulation [1.7901837062462316]
This paper aims to define and incorporate the natural symmetry present in physical robotic environments.
The proposed method is validated via two point-to-point reaching tasks of an industrial arm, with and without an obstacle.
A comparison study between the proposed method and a traditional off-policy reinforcement learning algorithm indicates its advantage in learning performance and potential value for applications.
arXiv Detail & Related papers (2023-04-12T11:38:01Z) - Accelerating Self-Imitation Learning from Demonstrations via Policy
Constraints and Q-Ensemble [6.861783783234304]
We propose a learning from demonstrations method named A-SILfD.
A-SILfD treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement.
In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training.
arXiv Detail & Related papers (2022-12-07T10:29:13Z) - Automated Fidelity Assessment for Strategy Training in Inpatient
Rehabilitation using Natural Language Processing [53.096237570992294]
Strategy training is a rehabilitation approach that teaches skills to reduce disability among those with cognitive impairments following a stroke.
Standardized fidelity assessment is used to measure adherence to treatment principles.
We developed a rule-based NLP algorithm, a long-short term memory (LSTM) model, and a bidirectional encoder representation from transformers (BERT) model for this task.
arXiv Detail & Related papers (2022-09-14T15:33:30Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Towards Understanding the Adversarial Vulnerability of Skeleton-based
Action Recognition [133.35968094967626]
Skeleton-based action recognition has attracted increasing attention due to its strong adaptability to dynamic circumstances.
With the help of deep learning techniques, it has also witnessed substantial progress and currently achieved around 90% accuracy in benign environment.
Research on the vulnerability of skeleton-based action recognition under different adversarial settings remains scant.
arXiv Detail & Related papers (2020-05-14T17:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.