Learning from Suboptimal Demonstration via Self-Supervised Reward
Regression
- URL: http://arxiv.org/abs/2010.11723v3
- Date: Mon, 23 Nov 2020 16:07:38 GMT
- Title: Learning from Suboptimal Demonstration via Self-Supervised Reward
Regression
- Authors: Letian Chen, Rohan Paleja, Matthew Gombolay
- Abstract summary: Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration.
Modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least optimalally optimal demonstrations.
We show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance.
We present a physical demonstration of teaching a robot a topspin strike in table tennis achieves 32% faster returns and 40% more topspin than user demonstration.
- Score: 1.2891210250935146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from Demonstration (LfD) seeks to democratize robotics by enabling
non-roboticist end-users to teach robots to perform a task by providing a human
demonstration. However, modern LfD techniques, e.g. inverse reinforcement
learning (IRL), assume users provide at least stochastically optimal
demonstrations. This assumption fails to hold in most real-world scenarios.
Recent attempts to learn from sub-optimal demonstration leverage pairwise
rankings and following the Luce-Shepard rule. However, we show these approaches
make incorrect assumptions and thus suffer from brittle, degraded performance.
We overcome these limitations in developing a novel approach that bootstraps
off suboptimal demonstrations to synthesize optimality-parameterized data to
train an idealized reward function. We empirically validate we learn an
idealized reward function with ~0.95 correlation with ground-truth reward
versus ~0.75 for prior work. We can then train policies achieving ~200%
improvement over the suboptimal demonstration and ~90% improvement over prior
work. We present a physical demonstration of teaching a robot a topspin strike
in table tennis that achieves 32% faster returns and 40% more topspin than user
demonstration.
Related papers
- FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning [74.25049012472502]
FLaRe is a large-scale Reinforcement Learning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques.
Our method aligns pre-trained policies towards task completion, achieving state-of-the-art (SoTA) performance on previously demonstrated and on entirely novel tasks and embodiments.
arXiv Detail & Related papers (2024-09-25T03:15:17Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Reward Learning from Suboptimal Demonstrations with Applications in Surgical Electrocautery [10.38505087117544]
This paper introduces a sample-efficient method that learns a robust reward function from a limited amount of ranked suboptimal demonstrations.
We show that using a learned reward function to obtain a policy is more robust than pure imitation learning.
We apply our approach on a physical surgical electrocautery task and demonstrate that our method can perform well even when the provided demonstrations are suboptimal.
arXiv Detail & Related papers (2024-04-10T17:40:27Z) - Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement
Learning [54.636562516974884]
In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on.
In this work, we propose MEDAL++, a novel design for self-improving robotic systems.
The robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations.
arXiv Detail & Related papers (2023-03-02T18:51:38Z) - NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via
Novel-View Synthesis [50.93065653283523]
SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF) is a fully-offline data augmentation scheme for improving robot policies.
Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations.
In a simulated 6-DoF visual grasping benchmark, SPARTN improves success rates by 2.8$times$ over imitation learning without the corrective augmentations.
arXiv Detail & Related papers (2023-01-18T23:25:27Z) - Learning Preferences for Interactive Autonomy [1.90365714903665]
This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities.
We first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function.
arXiv Detail & Related papers (2022-10-19T21:34:51Z) - Fast Lifelong Adaptive Inverse Reinforcement Learning from
Demonstrations [1.6050172226234585]
We propose a novel LfD framework, Fast Lifelong Adaptive Inverse Reinforcement learning (FLAIR)
We empirically validate that FLAIR achieves adaptability (i.e., the robot adapts to heterogeneous, user-specific task preferences), efficiency (i.e., the robot achieves sample-efficient adaptation), and scalability.
FLAIR surpasses benchmarks across three control tasks with an average 57% improvement in policy returns and an average 78% fewer episodes required for demonstration modeling.
arXiv Detail & Related papers (2022-09-24T02:48:02Z) - Towards Sample-efficient Apprenticeship Learning from Suboptimal
Demonstration [1.6114012813668934]
We present Systematic Self-Supervised Reward Regression, S3RR, to investigate systematic alternatives for trajectory degradation.
We find S3RR can learn comparable or better reward correlation with ground-truth against a state-of-the-art learning from suboptimal demonstration framework.
arXiv Detail & Related papers (2021-10-08T19:15:32Z) - A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels.
We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations.
We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories.
Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.