Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion
- URL: http://arxiv.org/abs/2310.04361v4
- Date: Tue, 12 Nov 2024 13:35:37 GMT
- Title: Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion
- Authors: Filip Szatkowski, Bartosz Wójcik, Mikołaj Piórczyński, Simone Scardapane,
- Abstract summary: Transformer models can face practical limitations due to their high computational requirements.
Such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers.
We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model.
- Score: 4.716845031095804
- License:
- Abstract: Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-$k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.
Related papers
- DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.
We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.
Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - S$^2$-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency [5.195584743414427]
Multi-agent Debate (MAD) has emerged as a viable approach for enhancing the reasoning capabilities of large language models (LLMs)
We introduce a novel sparsification strategy designed to reduce token costs within MAD.
This approach minimizes ineffective exchanges of information and unproductive discussions among agents, thereby enhancing the overall efficiency of the debate process.
arXiv Detail & Related papers (2025-02-07T09:49:56Z) - Mixture of Hidden-Dimensions Transformer [50.40325486463241]
We study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions.
We propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture.
It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost.
arXiv Detail & Related papers (2024-12-07T13:15:22Z) - CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [7.8430836312711465]
This paper reformulates the activation sparsification problem to explicitly capture the relationship between activation sparsity and model performance.
We propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification.
Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over eight downstream tasks while activating fewer parameters than existing methods.
arXiv Detail & Related papers (2024-09-02T16:41:44Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition [22.615830919860777]
This paper presents an efficient visual recognition paradigm, called Dynamic Adapter (Dyn-Adapter)
We devise a dynamic architecture with balanced early heads for multi-level feature extraction, along with adaptive training strategy.
We reduce FLOPs during inference by 50%, while maintaining or even yielding higher recognition accuracy.
arXiv Detail & Related papers (2024-07-19T13:33:38Z) - LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights [2.8461446020965435]
We introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing Latent Diffusion Models.
We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG)
arXiv Detail & Related papers (2024-04-18T06:35:37Z) - Learning Better with Less: Effective Augmentation for Sample-Efficient
Visual Reinforcement Learning [57.83232242068982]
Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms.
It remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL.
This work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy.
arXiv Detail & Related papers (2023-05-25T15:46:20Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z) - Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives.
We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.