Bayesian Q-learning With Imperfect Expert Demonstrations
- URL: http://arxiv.org/abs/2210.01800v1
- Date: Sat, 1 Oct 2022 17:38:19 GMT
- Title: Bayesian Q-learning With Imperfect Expert Demonstrations
- Authors: Fengdi Che, Xiru Zhu, Doina Precup, David Meger, and Gregory Dudek
- Abstract summary: We propose a novel algorithm to speed up Q-learning with the help of a limited amount of imperfect expert demonstrations.
We evaluate our approach on a sparse-reward chain environment and six more complicated Atari games with delayed rewards.
- Score: 56.55609745121237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Guided exploration with expert demonstrations improves data efficiency for
reinforcement learning, but current algorithms often overuse expert
information. We propose a novel algorithm to speed up Q-learning with the help
of a limited amount of imperfect expert demonstrations. The algorithm avoids
excessive reliance on expert data by relaxing the optimal expert assumption and
gradually reducing the usage of uninformative expert data. Experimentally, we
evaluate our approach on a sparse-reward chain environment and six more
complicated Atari games with delayed rewards. With the proposed methods, we can
achieve better results than Deep Q-learning from Demonstrations (Hester et al.,
2017) in most environments.
Related papers
- Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator [0.0]
Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift.
Soft Q imitation learning (SQIL) addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards.
arXiv Detail & Related papers (2024-01-30T06:22:19Z) - Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task.
We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z) - Leveraging Demonstrations to Improve Online Learning: Quality Matters [54.98983862640944]
We show that the degree of improvement must depend on the quality of the demonstration data.
We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule.
arXiv Detail & Related papers (2023-02-07T08:49:12Z) - On Covariate Shift of Latent Confounders in Imitation and Reinforcement
Learning [69.48387059607387]
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning.
We analyze the limitations of learning from confounded expert data with and without external reward.
We validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.
arXiv Detail & Related papers (2021-10-13T07:31:31Z) - Low-Regret Active learning [64.36270166907788]
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training.
At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on predictable (easy) instances.
arXiv Detail & Related papers (2021-04-06T22:53:45Z) - Learning Dexterous Manipulation from Suboptimal Experts [69.8017067648129]
Relative Entropy Q-Learning (REQ) is a simple policy algorithm that combines ideas from successful offline and conventional RL algorithms.
We show how REQ is also effective for general off-policy RL, offline RL, and RL from demonstrations.
arXiv Detail & Related papers (2020-10-16T18:48:49Z) - A Review of Meta-level Learning in the Context of Multi-component,
Multi-level Evolving Prediction Systems [6.810856082577402]
The exponential growth of volume, variety and velocity of data is raising the need for investigations of automated or semi-automated ways to extract useful patterns from the data.
It requires deep expert knowledge and extensive computational resources to find the most appropriate mapping of learning methods for a given problem.
There is a need for an intelligent recommendation engine that can advise what is the best learning algorithm for a dataset.
arXiv Detail & Related papers (2020-07-17T14:14:37Z) - Discriminator Soft Actor Critic without Extrinsic Rewards [0.30586855806896046]
It is difficult to imitate well in unknown states from a small amount of expert data and sampling data.
We propose Discriminator Soft Actor Critic (DSAC) to make this algorithm more robust to distribution shift.
arXiv Detail & Related papers (2020-01-19T10:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.