Off-Policy Reinforcement Learning with Loss Function Weighted by
Temporal Difference Error
- URL: http://arxiv.org/abs/2212.13175v1
- Date: Mon, 26 Dec 2022 14:32:16 GMT
- Title: Off-Policy Reinforcement Learning with Loss Function Weighted by
Temporal Difference Error
- Authors: Bumgeun Park, Taeyoung Kim, Woohyeon Moon, Luiz Felipe Vecchietti and
Dongsoo Har
- Abstract summary: Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning.
When calculating the loss function, off-policy algorithms assume that all samples are of the same importance.
We propose a novel method that introduces a weighting factor for each experience when calculating the loss function at the learning stage.
- Score: 2.255666468574186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training agents via off-policy deep reinforcement learning (RL) requires a
large memory, named replay memory, that stores past experiences used for
learning. These experiences are sampled, uniformly or non-uniformly, to create
the batches used for training. When calculating the loss function, off-policy
algorithms assume that all samples are of the same importance. In this paper,
we hypothesize that training can be enhanced by assigning different importance
for each experience based on their temporal-difference (TD) error directly in
the training objective. We propose a novel method that introduces a weighting
factor for each experience when calculating the loss function at the learning
stage. In addition to improving convergence speed when used with uniform
sampling, the method can be combined with prioritization methods for
non-uniform sampling. Combining the proposed method with prioritization methods
improves sampling efficiency while increasing the performance of TD-based
off-policy RL algorithms. The effectiveness of the proposed method is
demonstrated by experiments in six environments of the OpenAI Gym suite. The
experimental results demonstrate that the proposed method achieves a 33%~76%
reduction of convergence speed in three environments and an 11% increase in
returns and a 3%~10% increase in success rate for other three environments.
Related papers
- Adaptive teachers for amortized samplers [76.88721198565861]
Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnormalized density where exact sampling is intractable.
Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration.
We propose an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions.
arXiv Detail & Related papers (2024-10-02T11:33:13Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - Grad-Instructor: Universal Backpropagation with Explainable Evaluation Neural Networks for Meta-learning and AutoML [0.0]
An Evaluation Neural Network (ENN) is trained via deep reinforcement learning to predict the performance of the target network.
The ENN then works as an additional evaluation function during backpropagation.
arXiv Detail & Related papers (2024-06-15T08:37:51Z) - Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification [3.0398616939692777]
Techniques like adversarial learning, contrastive learning, diffusion denoising learning, and ordinary reconstruction learning have become standard.
The study aims to elucidate the advantages of pre-training techniques and fine-tuning strategies to enhance the learning process of neural networks.
arXiv Detail & Related papers (2024-05-29T15:44:51Z) - CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective [48.99488315273868]
We present a contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints.
Our method minimizes logit differences within the same sample by considering their numerical values.
We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO.
arXiv Detail & Related papers (2024-04-22T11:52:40Z) - Learning and reusing primitive behaviours to improve Hindsight
Experience Replay sample efficiency [7.806014635635933]
We propose a method that uses primitive behaviours that have been previously learned to solve simple tasks.
This guidance is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed.
We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time.
arXiv Detail & Related papers (2023-10-03T06:49:57Z) - Rethinking Population-assisted Off-policy Reinforcement Learning [7.837628433605179]
Off-policy reinforcement learning algorithms struggle with convergence to local optima due to limited exploration.
Population-based algorithms offer a natural exploration strategy, but their black-box operators are inefficient.
Recent algorithms have integrated these two methods, connecting them through a shared replay buffer.
arXiv Detail & Related papers (2023-05-04T15:53:00Z) - A Data-Centric Approach for Improving Adversarial Training Through the
Lens of Out-of-Distribution Detection [0.4893345190925178]
We propose detecting and removing hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects.
Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.
arXiv Detail & Related papers (2023-01-25T08:13:50Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Experience Replay with Likelihood-free Importance Weights [123.52005591531194]
We propose to reweight experiences based on their likelihood under the stationary distribution of the current policy.
We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3)
arXiv Detail & Related papers (2020-06-23T17:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.