Mixtures of Experts Unlock Parameter Scaling for Deep RL
- URL: http://arxiv.org/abs/2402.08609v3
- Date: Wed, 26 Jun 2024 16:50:01 GMT
- Title: Mixtures of Experts Unlock Parameter Scaling for Deep RL
- Authors: Johan Obando-Ceron, Ghada Sokar, Timon Willi, Clare Lyle, Jesse Farebrother, Jakob Foerster, Gintare Karolina Dziugaite, Doina Precup, Pablo Samuel Castro,
- Abstract summary: In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules into value-based networks results in more parameter-scalable models.
This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
- Score: 54.26191237981469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Related papers
- Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient [4.34286535607654]
We present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts.
Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom.
arXiv Detail & Related papers (2025-02-07T18:55:38Z) - Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.
LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space.
LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Scaling Law for Language Models Training Considering Batch Size [17.09348741898811]
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress.
We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess.
We establish a basic scaling law on model size and training data amount.
We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
arXiv Detail & Related papers (2024-12-02T13:58:35Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models [34.79589443380606]
The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment.
Our work investigates the transferability and discrepancies of scaling laws between Dense Models and MoE models.
arXiv Detail & Related papers (2024-10-08T03:21:56Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models.
Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z) - Scaling Laws for Fine-Grained Mixture of Experts [4.412803924115907]
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models.
In this work, we analyze their scaling properties, incorporating an expanded range of variables.
We establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.