Related papers: Towards Sample-efficient Apprenticeship Learning from Suboptimal Demonstration

Towards Sample-efficient Apprenticeship Learning from Suboptimal Demonstration

URL: http://arxiv.org/abs/2110.04347v1
Date: Fri, 8 Oct 2021 19:15:32 GMT
Title: Towards Sample-efficient Apprenticeship Learning from Suboptimal Demonstration
Authors: Letian Chen, Rohan Paleja, Matthew Gombolay
Abstract summary: We present Systematic Self-Supervised Reward Regression, S3RR, to investigate systematic alternatives for trajectory degradation. We find S3RR can learn comparable or better reward correlation with ground-truth against a state-of-the-art learning from suboptimal demonstration framework.
Score: 1.6114012813668934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform novel tasks by providing demonstrations. However, as demonstrators are typically non-experts, modern LfD techniques are unable to produce policies much better than the suboptimal demonstration. A previously-proposed framework, SSRR, has shown success in learning from suboptimal demonstration but relies on noise-injected trajectories to infer an idealized reward function. A random approach such as noise-injection to generate trajectories has two key drawbacks: 1) Performance degradation could be random depending on whether the noise is applied to vital states and 2) Noise-injection generated trajectories may have limited suboptimality and therefore will not accurately represent the whole scope of suboptimality. We present Systematic Self-Supervised Reward Regression, S3RR, to investigate systematic alternatives for trajectory degradation. We carry out empirical evaluations and find S3RR can learn comparable or better reward correlation with ground-truth against a state-of-the-art learning from suboptimal demonstration framework.

Related papers

Curating Demonstrations using Online Experience [52.59275477573012]
We show that Demo-SCORE can effectively identify suboptimal demonstrations without manual curation. Demo-SCORE achieves over 15-35% higher absolute success rate in the resulting policy compared to the base policy trained with all original demonstrations.
arXiv Detail & Related papers (2025-03-05T17:58:16Z)
SORREL: Suboptimal-Demonstration-Guided Reinforcement Learning for Learning to Branch [33.90726769113883]
Mixed Linear Program (MILP) solvers are mostly built upon a Branch sampling-and-Bound (B&B) algorithm, where the efficiency of traditional solvers heavily depends on hand-crafteds for branching. This paper proposes Sub-optimal-Demonstration-Reinforcement Learning (SORREL) for learning to branch.
arXiv Detail & Related papers (2024-12-20T03:48:53Z)
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance. We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z)
Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. We show that the proposed algorithms converge to the stationary solutions of the IRL problem. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Robustness of Demonstration-based Learning Under Limited Data Scenario [54.912936555876826]
Demonstration-based learning has shown great potential in stimulating pretrained language models' ability under limited data scenario. Why such demonstrations are beneficial for the learning process remains unclear since there is no explicit alignment between the demonstrations and the predictions. In this paper, we design pathological demonstrations by gradually removing intuitively useful information from the standard ones to take a deep dive of the robustness of demonstration-based sequence labeling.
arXiv Detail & Related papers (2022-10-19T16:15:04Z)
Self-Imitation Learning from Demonstrations [4.907551775445731]
Self-Imitation Learning exploits agent's past good experience to learn from suboptimal demonstrations. We show that SILfD can learn from demonstrations that are noisy or far from optimal. We also find SILfD superior to the existing state-of-the-art LfD algorithms in sparse environments.
arXiv Detail & Related papers (2022-03-21T11:56:56Z)
Improving Learning from Demonstrations by Learning from Experience [4.605233477425785]
We propose a new algorithm named TD3fG that can smoothly transition from learning from experts to learning from experience. Our algorithm achieves good performance in the MUJOCO environment with limited and sub-optimal demonstrations.
arXiv Detail & Related papers (2021-11-16T00:40:31Z)
Learning from Demonstration without Demonstrations [5.027571997864707]
We propose Probabilistic Planning for Demonstration Discovery (P2D2), a technique for automatically discovering demonstrations without access to an expert. We formulate discovering demonstrations as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree to find demonstration trajectories. We experimentally demonstrate the method outperforms classic and intrinsic exploration RL techniques in a range of classic control and robotics tasks.
arXiv Detail & Related papers (2021-06-17T01:57:08Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
Learning to Shift Attention for Motion Generation [55.61994201686024]
One challenge of motion generation using robot learning from demonstration techniques is that human demonstrations follow a distribution with multiple modes for one task query. Previous approaches fail to capture all modes or tend to average modes of the demonstrations and thus generate invalid trajectories. We propose a motion generation model with extrapolation ability to overcome this problem.
arXiv Detail & Related papers (2021-02-24T09:07:52Z)
Learning from Suboptimal Demonstration via Self-Supervised Reward Regression [1.2891210250935146]
Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. Modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least optimalally optimal demonstrations. We show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We present a physical demonstration of teaching a robot a topspin strike in table tennis achieves 32% faster returns and 40% more topspin than user demonstration.
arXiv Detail & Related papers (2020-10-17T04:18:04Z)
Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning [75.56839075060819]
Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline.
arXiv Detail & Related papers (2020-05-21T19:47:05Z)
Learning rewards for robotic ultrasound scanning using probabilistic temporal ranking [17.494224125794187]
This work considers the inverse problem, where the goal of the task is unknown, and a reward function needs to be inferred from example demonstrations. Many existing reward inference strategies are unsuited to this class of problems, due to the exploratory nature of the demonstrations. We formalise this emphprobabilistic temporal ranking approach and show that it improves upon existing approaches.
arXiv Detail & Related papers (2020-02-04T11:58:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.