Dissecting Deep RL with High Update Ratios: Combatting Value Divergence
        - URL: http://arxiv.org/abs/2403.05996v3
- Date: Mon, 5 Aug 2024 11:55:19 GMT
- Title: Dissecting Deep RL with High Update Ratios: Combatting Value Divergence
- Authors: Marcel Hussing, Claas Voelcker, Igor Gilitschenski, Amir-massoud Farahmand, Eric Eaton, 
- Abstract summary: We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters.
We employ a simple unit-ball normalization that enables learning under large update ratios.
- Score: 21.282292112642747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples by combatting value function divergence. Under large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we investigate the phenomena leading to the primacy bias. We inspect the early stages of training that were conjectured to cause the failure to learn and find that one fundamental challenge is a long-standing acquaintance: value function divergence. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be linked to overestimation on unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting early data. 
 
      
        Related papers
        - In-Context Linear Regression Demystified: Training Dynamics and   Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
 We study how multi-head softmax attention models are trained to perform in-context learning on linear data.
Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
 arXiv  Detail & Related papers  (2025-03-17T02:00:49Z)
- MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL [20.22674077197914]
 Recent work has explored updating neural networks with large numbers of gradient steps for every new sample.
High update-to-data ratios introduce instability to the training process.
Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD), uses small amounts of generated data to stabilize high UTD training.
 arXiv  Detail & Related papers  (2024-10-11T15:13:17Z)
- Temporal-Difference Variational Continual Learning [89.32940051152782]
 A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks.
In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge.
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
 arXiv  Detail & Related papers  (2024-10-10T10:58:41Z)
- ReAugment: Model Zoo-Guided RL for Few-Shot Time Series Augmentation and   Forecasting [74.00765474305288]
 We present a pilot study on using reinforcement learning (RL) for time series data augmentation.<n>Our method, ReAugment, tackles three critical questions: which parts of the training set should be augmented, how the augmentation should be performed, and what advantages RL brings to the process.
 arXiv  Detail & Related papers  (2024-09-10T07:34:19Z)
- Federated Class-Incremental Learning with Hierarchical Generative   Prototypes [10.532838477096055]
 Federated Learning (FL) aims at unburdening the training of deep models by distributing computation across multiple devices (clients)
Our proposal constrains both biases in the last layer by efficiently finetuning a pre-trained backbone using learnable prompts.
Our method significantly improves the current State Of The Art, providing an average increase of +7.8% in accuracy.
 arXiv  Detail & Related papers  (2024-06-04T16:12:27Z)
- Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
  Incremental Learning [100.7407460674153]
 Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
 arXiv  Detail & Related papers  (2024-01-12T12:51:12Z)
- Directly Attention Loss Adjusted Prioritized Experience Replay [0.07366405857677226]
 Prioritized Replay Experience (PER) enables the model to learn more about relatively important samples by artificially changing their accessed frequencies.
DALAP is proposed, which can directly quantify the changed extent of the shifted distribution through Parallel Self-Attention network.
 arXiv  Detail & Related papers  (2023-11-24T10:14:05Z)
- Understanding, Predicting and Better Resolving Q-Value Divergence in
  Offline-RL [86.0987896274354]
 We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
 arXiv  Detail & Related papers  (2023-10-06T17:57:44Z)
- Understanding and Mitigating the Label Noise in Pre-training on
  Downstream Tasks [91.15120211190519]
 This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
 arXiv  Detail & Related papers  (2023-09-29T06:18:15Z)
- Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
 This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
 arXiv  Detail & Related papers  (2023-09-11T06:26:57Z)
- Time Series Contrastive Learning with Information-Aware Augmentations [57.45139904366001]
 A key component of contrastive learning is to select appropriate augmentations imposing some priors to construct feasible positive samples.
How to find the desired augmentations of time series data that are meaningful for given contrastive learning tasks and datasets remains an open question.
We propose a new contrastive learning approach with information-aware augmentations, InfoTS, that adaptively selects optimal augmentations for time series representation learning.
 arXiv  Detail & Related papers  (2023-03-21T15:02:50Z)
- TWINS: A Fine-Tuning Framework for Improved Transferability of
  Adversarial Robustness and Generalization [89.54947228958494]
 This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
 arXiv  Detail & Related papers  (2023-03-20T14:12:55Z)
- SURF: Semi-supervised Reward Learning with Data Augmentation for
  Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
 We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
 arXiv  Detail & Related papers  (2022-03-18T16:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.