Towards Sample-efficient Apprenticeship Learning from Suboptimal
Demonstration
- URL: http://arxiv.org/abs/2110.04347v1
- Date: Fri, 8 Oct 2021 19:15:32 GMT
- Title: Towards Sample-efficient Apprenticeship Learning from Suboptimal
Demonstration
- Authors: Letian Chen, Rohan Paleja, Matthew Gombolay
- Abstract summary: We present Systematic Self-Supervised Reward Regression, S3RR, to investigate systematic alternatives for trajectory degradation.
We find S3RR can learn comparable or better reward correlation with ground-truth against a state-of-the-art learning from suboptimal demonstration framework.
- Score: 1.6114012813668934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from Demonstration (LfD) seeks to democratize robotics by enabling
non-roboticist end-users to teach robots to perform novel tasks by providing
demonstrations. However, as demonstrators are typically non-experts, modern LfD
techniques are unable to produce policies much better than the suboptimal
demonstration. A previously-proposed framework, SSRR, has shown success in
learning from suboptimal demonstration but relies on noise-injected
trajectories to infer an idealized reward function. A random approach such as
noise-injection to generate trajectories has two key drawbacks: 1) Performance
degradation could be random depending on whether the noise is applied to vital
states and 2) Noise-injection generated trajectories may have limited
suboptimality and therefore will not accurately represent the whole scope of
suboptimality. We present Systematic Self-Supervised Reward Regression, S3RR,
to investigate systematic alternatives for trajectory degradation. We carry out
empirical evaluations and find S3RR can learn comparable or better reward
correlation with ground-truth against a state-of-the-art learning from
suboptimal demonstration framework.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Robustness of Demonstration-based Learning Under Limited Data Scenario [54.912936555876826]
Demonstration-based learning has shown great potential in stimulating pretrained language models' ability under limited data scenario.
Why such demonstrations are beneficial for the learning process remains unclear since there is no explicit alignment between the demonstrations and the predictions.
In this paper, we design pathological demonstrations by gradually removing intuitively useful information from the standard ones to take a deep dive of the robustness of demonstration-based sequence labeling.
arXiv Detail & Related papers (2022-10-19T16:15:04Z) - Self-Imitation Learning from Demonstrations [4.907551775445731]
Self-Imitation Learning exploits agent's past good experience to learn from suboptimal demonstrations.
We show that SILfD can learn from demonstrations that are noisy or far from optimal.
We also find SILfD superior to the existing state-of-the-art LfD algorithms in sparse environments.
arXiv Detail & Related papers (2022-03-21T11:56:56Z) - Improving Learning from Demonstrations by Learning from Experience [4.605233477425785]
We propose a new algorithm named TD3fG that can smoothly transition from learning from experts to learning from experience.
Our algorithm achieves good performance in the MUJOCO environment with limited and sub-optimal demonstrations.
arXiv Detail & Related papers (2021-11-16T00:40:31Z) - Learning from Demonstration without Demonstrations [5.027571997864707]
We propose Probabilistic Planning for Demonstration Discovery (P2D2), a technique for automatically discovering demonstrations without access to an expert.
We formulate discovering demonstrations as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree to find demonstration trajectories.
We experimentally demonstrate the method outperforms classic and intrinsic exploration RL techniques in a range of classic control and robotics tasks.
arXiv Detail & Related papers (2021-06-17T01:57:08Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Learning to Shift Attention for Motion Generation [55.61994201686024]
One challenge of motion generation using robot learning from demonstration techniques is that human demonstrations follow a distribution with multiple modes for one task query.
Previous approaches fail to capture all modes or tend to average modes of the demonstrations and thus generate invalid trajectories.
We propose a motion generation model with extrapolation ability to overcome this problem.
arXiv Detail & Related papers (2021-02-24T09:07:52Z) - Learning from Suboptimal Demonstration via Self-Supervised Reward
Regression [1.2891210250935146]
Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration.
Modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least optimalally optimal demonstrations.
We show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance.
We present a physical demonstration of teaching a robot a topspin strike in table tennis achieves 32% faster returns and 40% more topspin than user demonstration.
arXiv Detail & Related papers (2020-10-17T04:18:04Z) - Guided Uncertainty-Aware Policy Optimization: Combining Learning and
Model-Based Strategies for Sample-Efficient Policy Learning [75.56839075060819]
Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state.
reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle.
In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline.
arXiv Detail & Related papers (2020-05-21T19:47:05Z) - Learning rewards for robotic ultrasound scanning using probabilistic
temporal ranking [17.494224125794187]
This work considers the inverse problem, where the goal of the task is unknown, and a reward function needs to be inferred from example demonstrations.
Many existing reward inference strategies are unsuited to this class of problems, due to the exploratory nature of the demonstrations.
We formalise this emphprobabilistic temporal ranking approach and show that it improves upon existing approaches.
arXiv Detail & Related papers (2020-02-04T11:58:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.