Related papers: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

URL: http://arxiv.org/abs/2501.12370v2
Date: Sat, 25 Jan 2025 02:41:24 GMT
Title: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Authors: Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak,
Abstract summary: sparse Mixture-of-Experts (MoEs) allow scaling the number of parameters without proportionally increasing the FLOPs per example.<n>We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation.
Score: 10.517704202614091
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

Related papers

Scaling Embeddings Outperforms Scaling Experts in Language Models [25.29349741727901]
We explore embedding scaling as a potent, dimension for scaling sparsity.<n>We introduce LongCat-Flash-Lite, a 68.5B parameter model with 3B activated trained from scratch.<n>LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale.
arXiv Detail & Related papers (2026-01-29T03:11:19Z)
High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning [57.85676271833619]
Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning.<n>We present textbfSMoA, a high-rank textbfStructured textbfMOdulation textbfAdapter that uses fewer trainable parameters while maintaining a higher rank.
arXiv Detail & Related papers (2026-01-12T13:06:17Z)
Towards a Comprehensive Scaling Law of Mixture-of-Experts [54.117786590884776]
We propose a comprehensive and precise joint MoE scaling law that considers all essential factors.<n>Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size.<n>Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.
arXiv Detail & Related papers (2025-09-28T06:35:34Z)
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks [17.067788440109137]
Mixture-of-Experts (MoE) models are now standard in state-of-the-art systems.<n>We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills.
arXiv Detail & Related papers (2025-08-26T04:31:28Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
Dynamic Fisher-weighted Model Merging via Bayesian Optimization [37.02810891820468]
Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise. We unify these strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge) We show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks.
arXiv Detail & Related papers (2025-04-26T18:31:14Z)
MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models [10.623996218106564]
Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs)<n>We introduce MoLAE, a novel parameterization that reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations.<n>We show that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities.
arXiv Detail & Related papers (2025-03-29T14:35:34Z)
Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement [0.7558576228782637]
We propose a framework for efficient Source-Free Domain Adaptation (SFDA) Our approach introduces an improved paradigm for source-model preparation and target-side adaptation. We demonstrate that our framework is compatible with various SFDA methods and achieves significant computational efficiency.
arXiv Detail & Related papers (2024-10-03T02:12:03Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
Mixtures of Experts Unlock Parameter Scaling for Deep RL [54.26191237981469]
In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules into value-based networks results in more parameter-scalable models. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
arXiv Detail & Related papers (2024-02-13T17:18:56Z)
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC) PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture. Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z)
Can pruning make Large Language Models more efficient? [0.0]
This paper investigates the application of weight pruning as an optimization strategy for Transformer architectures. Our findings suggest that significant reductions in model size are attainable without considerable compromise on performance. This work seeks to bridge the gap between model efficiency and performance, paving the way for more scalable and environmentally responsible deep learning applications.
arXiv Detail & Related papers (2023-10-06T20:28:32Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging [24.64264715041198]
We introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model from the previous phase. SMS preserves sparsity, exploits sparse network benefits, is modular and fully parallelizable, and substantially improves IMP's performance.
arXiv Detail & Related papers (2023-06-29T08:49:41Z)
Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity. Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z)
A Generic Performance Model for Deep Learning in a Distributed Environment [0.7829352305480285]
We propose a generic performance model of an application in a distributed environment with a generic expression of the application execution time. We have evaluated the proposed model on three deep learning frameworks (i.e., MXnet, and Pytorch)
arXiv Detail & Related papers (2023-05-19T13:30:34Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.