ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards
- URL: http://arxiv.org/abs/2501.14513v1
- Date: Fri, 24 Jan 2025 14:18:22 GMT
- Title: ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards
- Authors: Fanxing Li, Fangyu Sun, Tianbao Zhang, Danping Zou,
- Abstract summary: Partially differentiable rewards will result in biased gradient propagation that degrades training performance.
We propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT.
ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function.
- Score: 3.1986315488647588
- License:
- Abstract: Using the exact gradients of the rewards to directly optimize policy parameters via backpropagation-through-time (BPTT) enables high training performance for quadrotor tasks. However, designing a fully differentiable reward architecture is often challenging. Partially differentiable rewards will result in biased gradient propagation that degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to encourage exploration during training. We evaluate ABPT on four representative quadrotor flight tasks. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing learning algorithms, particularly in tasks involving partially differentiable rewards.
Related papers
- SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks [0.0]
We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC.
Our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
arXiv Detail & Related papers (2025-01-07T10:22:30Z) - Fast and Unified Path Gradient Estimators for Normalizing Flows [5.64979077798699]
path gradient estimators for normalizing flows have lower variance compared to standard estimators for variational inference.
We propose a fast path gradient estimator which improves computational efficiency significantly.
We empirically establish its superior performance and reduced variance for several natural sciences applications.
arXiv Detail & Related papers (2024-03-23T16:21:22Z) - Directly Attention Loss Adjusted Prioritized Experience Replay [0.07366405857677226]
Prioritized Replay Experience (PER) enables the model to learn more about relatively important samples by artificially changing their accessed frequencies.
DALAP is proposed, which can directly quantify the changed extent of the shifted distribution through Parallel Self-Attention network.
arXiv Detail & Related papers (2023-11-24T10:14:05Z) - Sample-Efficient Optimisation with Probabilistic Transformer Surrogates [66.98962321504085]
This paper investigates the feasibility of employing state-of-the-art probabilistic transformers in Bayesian optimisation.
We observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation.
We introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance.
arXiv Detail & Related papers (2022-05-27T11:13:17Z) - Back to Basics: Efficient Network Compression via IMP [22.586474627159287]
Iterative Magnitude Pruning (IMP) is one of the most established approaches for network pruning.
IMP is often argued that it reaches suboptimal states by not incorporating sparsification into the training phase.
We find that IMP with SLR for retraining can outperform state-of-the-art pruning-during-training approaches.
arXiv Detail & Related papers (2021-11-01T11:23:44Z) - Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale.
We show that this phenomenon exists, and then propose a first-order correction term to momentum.
An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z) - Experience Replay with Likelihood-free Importance Weights [123.52005591531194]
We propose to reweight experiences based on their likelihood under the stationary distribution of the current policy.
We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3)
arXiv Detail & Related papers (2020-06-23T17:17:44Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing
its Gradient Estimator Bias [65.13042449121411]
In practice, training a network with the gradient estimates provided by EP does not scale to visual tasks harder than MNIST.
We show that a bias in the gradient estimate of EP, inherent in the use of finite nudging, is responsible for this phenomenon.
We apply these techniques to train an architecture with asymmetric forward and backward connections, yielding a 13.2% test error.
arXiv Detail & Related papers (2020-06-06T09:36:07Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.