Efficient Transformers in Reinforcement Learning using Actor-Learner
Distillation
- URL: http://arxiv.org/abs/2104.01655v1
- Date: Sun, 4 Apr 2021 17:56:34 GMT
- Title: Efficient Transformers in Reinforcement Learning using Actor-Learner
Distillation
- Authors: Emilio Parisotto, Ruslan Salakhutdinov
- Abstract summary: "Actor-Learner Distillation" transfers learning progress from a large capacity learner model to a small capacity actor model.
We demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model.
- Score: 91.05073136215886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-world applications such as robotics provide hard constraints on
power and compute that limit the viable model complexity of Reinforcement
Learning (RL) agents. Similarly, in many distributed RL settings, acting is
done on un-accelerated hardware such as CPUs, which likewise restricts model
size to prevent intractable experiment run times. These "actor-latency"
constrained settings present a major obstruction to the scaling up of model
complexity that has recently been extremely successful in supervised learning.
To be able to utilize large model capacity while still operating within the
limits imposed by the system during acting, we develop an "Actor-Learner
Distillation" (ALD) procedure that leverages a continual form of distillation
that transfers learning progress from a large capacity learner model to a small
capacity actor model. As a case study, we develop this procedure in the context
of partially-observable environments, where transformer models have had large
improvements over LSTMs recently, at the cost of significantly higher
computational complexity. With transformer models as the learner and LSTMs as
the actor, we demonstrate in several challenging memory environments that using
Actor-Learner Distillation recovers the clear sample-efficiency gains of the
transformer learner model while maintaining the fast inference and reduced
total training time of the LSTM actor model.
Related papers
- Bridging the Resource Gap: Deploying Advanced Imitation Learning Models onto Affordable Embedded Platforms [13.488752211167533]
We propose a pipeline that facilitates the migration of advanced imitation learning algorithms to edge devices.
To show the efficiency of the proposed pipeline, large-scale imitation learning models are trained on a server and deployed on an edge device to complete various manipulation tasks.
arXiv Detail & Related papers (2024-11-18T09:28:11Z) - Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition [10.302458835329539]
We introduce a new method, namely Transformer Re- parameterization, to boost the performance of lightweight Transformer models.
Experimental results show that our proposed method consistently improves the performance of lightweight Transformers, even making them comparable to large models.
arXiv Detail & Related papers (2024-11-14T10:36:19Z) - Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models [0.0]
Transformer Layer Injection (TLI) is a novel method for efficiently upscaling large language models (LLMs)
Our approach improves upon the conventional Depth Up-Scaling (DUS) technique by injecting new layers into every set of K layers.
arXiv Detail & Related papers (2024-10-15T14:41:44Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Learning a model is paramount for sample efficiency in reinforcement
learning control of PDEs [5.488334211013093]
We show that learning an actuated model in parallel to training the RL agent significantly reduces the total amount of required data sampled from the real system.
We also show that iteratively updating the model is of major importance to avoid biases in the RL training.
arXiv Detail & Related papers (2023-02-14T16:14:39Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - RLFlow: Optimising Neural Network Subgraph Transformation with World
Models [0.0]
We propose a model-based agent which learns to optimise the architecture of neural networks by performing a sequence of subgraph transformations to reduce model runtime.
We show our approach can match the performance of state of the art on common convolutional networks and outperform those by up to 5% on transformer-style architectures.
arXiv Detail & Related papers (2022-05-03T11:52:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.