Optimizing the Long-Term Behaviour of Deep Reinforcement Learning for
Pushing and Grasping
- URL: http://arxiv.org/abs/2204.03487v1
- Date: Thu, 7 Apr 2022 15:02:44 GMT
- Title: Optimizing the Long-Term Behaviour of Deep Reinforcement Learning for
Pushing and Grasping
- Authors: Rodrigo Chau
- Abstract summary: We investigate the capabilities of two systems to learn long-term rewards and policies.
Ewerton et al. attain their best performance using an agent which only takes the most immediate action under consideration.
We show that this approach enables the models to accurately predict long-term action sequences when trained with large discount factors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the "Visual Pushing for Grasping" (VPG) system by Zeng et al.
and the "Hourglass" system by Ewerton et al., an evolution of the former. The
focus of our work is the investigation of the capabilities of both systems to
learn long-term rewards and policies. Zeng et al. original task only needs a
limited amount of foresight. Ewerton et al. attain their best performance using
an agent which only takes the most immediate action under consideration. We are
interested in the ability of their models and training algorithms to accurately
predict long-term Q-Values. To evaluate this ability, we design a new bin
sorting task and reward function. Our task requires agents to accurately
estimate future rewards and therefore use high discount factors in their
Q-Value calculation. We investigate the behaviour of an adaptation of the VPG
training algorithm on our task. We show that this adaptation can not accurately
predict the required long-term action sequences. In addition to the limitations
identified by Ewerton et al., it suffers from the known Deep Q-Learning problem
of overestimated Q-Values. In an effort to solve our task, we turn to the
Hourglass models and combine them with the Double Q-Learning approach. We show
that this approach enables the models to accurately predict long-term action
sequences when trained with large discount factors. Our results show that the
Double Q-Learning technique is essential for training with very high discount
factors, as the models Q-Value predictions diverge otherwise. We also
experiment with different approaches for discount factor scheduling, loss
calculation and exploration procedures. Our results show that the latter
factors do not visibly influence the model's performance for our task.
Related papers
- Towards Adapting Reinforcement Learning Agents to New Tasks: Insights from Q-Values [8.694989771294013]
Policy gradient methods can still be useful in many domains as long as we can wrangle with how to exploit them in a sample efficient way.
We explore the chaotic nature of DQNs in reinforcement learning, while understanding how the information that they retain when trained can be repurposed for adapting a model to different tasks.
arXiv Detail & Related papers (2024-07-14T21:28:27Z) - Modeling of learning curves with applications to pos tagging [0.27624021966289597]
We introduce an algorithm to estimate the evolution of learning curves on the whole of a training data base.
We approximate iteratively the sought value at the desired time, independently of the learning technique used.
The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition.
arXiv Detail & Related papers (2024-02-04T15:00:52Z) - VQC-Based Reinforcement Learning with Data Re-uploading: Performance and Trainability [0.8192907805418583]
Reinforcement Learning (RL) consists of designing agents that make intelligent decisions without human supervision.
Deep Q-Learning, a RL algorithm that uses Deep NNs, achieved super-human performance in some specific tasks.
It is also possible to use Variational Quantum Circuits (VQCs) as function approximators in RL algorithms.
arXiv Detail & Related papers (2024-01-21T18:00:15Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Goal-Conditioned Q-Learning as Knowledge Distillation [136.79415677706612]
We explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation.
We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional.
We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals.
arXiv Detail & Related papers (2022-08-28T22:01:10Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Hindsight Experience Replay with Kronecker Product Approximate Curvature [5.441932327359051]
Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks.
But due to its reduced sample efficiency and slower convergence HER fails to perform effectively.
Natural gradients solves these challenges by converging the model parameters better.
Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate.
arXiv Detail & Related papers (2020-10-09T20:25:14Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - Hierarchical Reinforcement Learning as a Model of Human Task
Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning.
The model reproduces known empirical effects of task interleaving.
The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.