LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning
- URL: http://arxiv.org/abs/2106.01777v1
- Date: Thu, 3 Jun 2021 12:00:38 GMT
- Title: LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning
- Authors: Aaron J. Snoswell, Surya P. N. Singh, Nan Ye
- Abstract summary: Multiple-Intent Inverse Reinforcement Learning seeks to find a reward function ensemble to rationalize demonstrations of different but unlabelled intents.
We present a warm-start strategy based on up-front clustering of the demonstrations in feature space.
We also propose a MI-IRL performance metric that generalizes the popular Expected Value Difference measure.
- Score: 5.1779694507922835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple-Intent Inverse Reinforcement Learning (MI-IRL) seeks to find a
reward function ensemble to rationalize demonstrations of different but
unlabelled intents. Within the popular expectation maximization (EM) framework
for learning probabilistic MI-IRL models, we present a warm-start strategy
based on up-front clustering of the demonstrations in feature space. Our
theoretical analysis shows that this warm-start solution produces a
near-optimal reward ensemble, provided the behavior modes satisfy mild
separation conditions. We also propose a MI-IRL performance metric that
generalizes the popular Expected Value Difference measure to directly assesses
learned rewards against the ground-truth reward ensemble. Our metric elegantly
addresses the difficulty of pairing up learned and ground truth rewards via a
min-cost flow formulation, and is efficiently computable. We also develop a
MI-IRL benchmark problem that allows for more comprehensive algorithmic
evaluations. On this problem, we find our MI-IRL warm-start strategy helps
avoid poor quality local minima reward ensembles, resulting in a significant
improvement in behavior clustering. Our extensive sensitivity analysis
demonstrates that the quality of the learned reward ensembles is improved under
various settings, including cases where our theoretical assumptions do not
necessarily hold. Finally, we demonstrate the effectiveness of our methods by
discovering distinct driving styles in a large real-world dataset of driver GPS
trajectories.
Related papers
- R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models [69.51130760097818]
We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function.
We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks.
arXiv Detail & Related papers (2023-11-15T04:40:43Z) - Principled Reinforcement Learning with Human Feedback from Pairwise or
$K$-wise Comparisons [79.98542868281473]
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF)
We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
arXiv Detail & Related papers (2023-01-26T18:07:21Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Learning Dense Reward with Temporal Variant Self-Supervision [5.131840233837565]
Complex real-world robotic applications lack explicit and informative descriptions that can directly be used as rewards.
Previous effort has shown that it is possible to algorithmically extract dense rewards directly from multimodal observations.
This paper proposes a more efficient and robust way of sampling and learning.
arXiv Detail & Related papers (2022-05-20T20:30:57Z) - Sample Efficient Imitation Learning via Reward Function Trained in
Advance [2.66512000865131]
Imitation learning (IL) is a framework that learns to imitate expert behavior from demonstrations.
In this article, we make an effort to improve sample efficiency by introducing a novel scheme of inverse reinforcement learning.
arXiv Detail & Related papers (2021-11-23T08:06:09Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Inverse Reinforcement Learning via Matching of Optimality Profiles [2.561053769852449]
We propose an algorithm that learns a reward function from demonstrations of suboptimal or heterogeneous performance.
We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
arXiv Detail & Related papers (2020-11-18T13:23:43Z) - Efficient Exploration of Reward Functions in Inverse Reinforcement
Learning via Bayesian Optimization [43.51553742077343]
inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration.
This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions consistent with the expert demonstrations.
arXiv Detail & Related papers (2020-11-17T10:17:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.