Self-Imitation Learning via Generalized Lower Bound Q-learning
- URL: http://arxiv.org/abs/2006.07442v3
- Date: Sun, 14 Feb 2021 00:06:01 GMT
- Title: Self-Imitation Learning via Generalized Lower Bound Q-learning
- Authors: Yunhao Tang
- Abstract summary: Self-imitation learning motivated by lower-bound Q-learning is a novel and effective approach for off-policy learning.
We propose a n-step lower bound which generalizes the original return-based lower-bound Q-learning.
We show that n-step lower bound Q-learning is a more robust alternative to return-based self-imitation learning and uncorrected n-step.
- Score: 23.65188248947536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-imitation learning motivated by lower-bound Q-learning is a novel and
effective approach for off-policy learning. In this work, we propose a n-step
lower bound which generalizes the original return-based lower-bound Q-learning,
and introduce a new family of self-imitation learning algorithms. To provide a
formal motivation for the potential performance gains provided by
self-imitation learning, we show that n-step lower bound Q-learning achieves a
trade-off between fixed point bias and contraction rate, drawing close
connections to the popular uncorrected n-step Q-learning. We finally show that
n-step lower bound Q-learning is a more robust alternative to return-based
self-imitation learning and uncorrected n-step, over a wide range of continuous
control benchmark tasks.
Related papers
- Is Q-learning an Ill-posed Problem? [2.4424095531386234]
This paper investigates the instability of Q-learning in continuous environments.
We show that even in relatively simple benchmarks, the fundamental task of Q-learning can be inherently ill-posed and prone to failure.
arXiv Detail & Related papers (2025-02-20T08:42:30Z) - Online inductive learning from answer sets for efficient reinforcement learning exploration [52.03682298194168]
We exploit inductive learning of answer set programs to learn a set of logical rules representing an explainable approximation of the agent policy.
We then perform answer set reasoning on the learned rules to guide the exploration of the learning agent at the next batch.
Our methodology produces a significant boost in the discounted return achieved by the agent, even in the first batches of training.
arXiv Detail & Related papers (2025-01-13T16:13:22Z) - Temporal-Difference Variational Continual Learning [89.32940051152782]
A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks.
In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge.
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
arXiv Detail & Related papers (2024-10-10T10:58:41Z) - Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models [96.43562963756975]
We train a regression model, which target variable is the gap between the conditional and the unconditional generation confidence.
We use this learned conditional dependency model to modulate the uncertainty of the current generation step based on the uncertainty of the previous step.
arXiv Detail & Related papers (2024-08-20T09:42:26Z) - Self-Paced Absolute Learning Progress as a Regularized Approach to
Curriculum Learning [4.054285623919103]
Curricula based on Absolute Learning Progress (ALP) have proven successful in different environments, but waste computation on repeating already learned behaviour in new tasks.
We solve this problem by introducing a new regularization method based on Self-Paced (Deep) Learning, called Self-Paced Absolute Learning Progress (SPALP)
Our method achieves performance comparable to original ALP in all cases, and reaches it quicker than ALP in two of them.
arXiv Detail & Related papers (2023-06-09T09:17:51Z) - VA-learning as a more efficient alternative to Q-learning [49.526579981437315]
We introduce VA-learning, which directly learns advantage function and value function using bootstrapping.
VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning.
Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning.
arXiv Detail & Related papers (2023-05-29T15:44:47Z) - Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Periodic Q-Learning [24.099046883918046]
We study the so-called periodic Q-learning algorithm (PQ-learning for short)
PQ-learning maintains two separate Q-value estimates - the online estimate and target estimate.
In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample for finding an epsilon-optimal policy.
arXiv Detail & Related papers (2020-02-23T00:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.