Related papers: MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

URL: http://arxiv.org/abs/2410.08896v1
Date: Fri, 11 Oct 2024 15:13:17 GMT
Title: MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL
Authors: Claas A Voelcker, Marcel Hussing, Eric Eaton, Amir-massoud Farahmand, Igor Gilitschenski,
Abstract summary: Recent work has explored updating neural networks with large numbers of gradient steps for every new sample. High update-to-data ratios introduce instability to the training process. Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD), uses small amounts of generated data to stabilize high UTD training.
Score: 20.22674077197914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.

Related papers

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
An Active Learning-Based Streaming Pipeline for Reduced Data Training of Structure Finding Models in Neutron Diffractometry [1.3083205962260995]
We introduce a novel batch-mode active learning (AL) policy that uses uncertainty sampling to simulate training data drawn from a probability distribution.<n>We confirm its efficacy in training the same models with about 75% less training data while improving the accuracy.<n>We then discuss the design of an efficient stream-based training workflow that uses this AL policy and present a performance study on two heterogeneous platforms.
arXiv Detail & Related papers (2025-06-06T15:48:22Z)
SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability. We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases. During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data.
arXiv Detail & Related papers (2025-01-15T09:04:19Z)
Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal [54.93261535899478]
In real-world applications, such as robotic control of reinforcement learning, the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. We propose a rehearsal-based continual diffusion model, called Continual diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability)
arXiv Detail & Related papers (2024-09-04T08:21:47Z)
Controlling Forgetting with Test-Time Data in Continual Learning [15.455400390299593]
Ongoing Continual Learning research provides techniques to overcome catastrophic forgetting of previous information when new knowledge is acquired. We argue that test-time data hold great information that can be leveraged in a self supervised manner to refresh the model's memory of previous learned tasks.
arXiv Detail & Related papers (2024-06-19T15:56:21Z)
VIRL: Volume-Informed Representation Learning towards Few-shot Manufacturability Estimation [0.0]
This work introduces VIRL, a Volume-Informed Representation Learning approach to pre-train a 3D geometric encoder. The model pre-trained by VIRL shows substantial enhancements on demonstrating improved generalizability with limited data.
arXiv Detail & Related papers (2024-06-18T05:30:26Z)
Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters. We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Learning to Modulate pre-trained Models in RL [22.812215561012874]
Fine-tuning a pre-trained model often suffers from catastrophic forgetting. Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly. We propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model.
arXiv Detail & Related papers (2023-06-26T17:53:05Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z)
Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution. This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes. Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
Regularizing Generative Adversarial Networks under Limited Data [88.57330330305535]
This work proposes a regularization approach for training robust GAN models on limited data. We show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data.
arXiv Detail & Related papers (2021-04-07T17:59:06Z)
Off-Policy Meta-Reinforcement Learning Based on Feature Embedding Spaces [14.029933823101084]
We propose a novel off-policy meta-RL method, embedding learning and evaluation of uncertainty (ELUE) ELUE learns a belief model over the embedding space and a belief-conditional policy and Q-function. We demonstrate that ELUE outperforms state-of-the-art meta RL methods through experiments on meta-RL benchmarks.
arXiv Detail & Related papers (2021-01-06T05:51:38Z)
Overcoming Model Bias for Robust Offline Deep Reinforcement Learning [3.1325640909772403]
MOOSE is an algorithm which ensures low model bias by keeping the policy within the support of the data. We compare MOOSE with state-of-the-art model-free, offline RL algorithms BRAC, BEAR and BCQ on the Industrial Benchmark and MuJoCo continuous control tasks in terms of robust performance.
arXiv Detail & Related papers (2020-08-12T19:08:55Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.