MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
- URL: http://arxiv.org/abs/2406.04801v1
- Date: Fri, 7 Jun 2024 10:05:42 GMT
- Title: MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
- Authors: Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai,
- Abstract summary: Training MoE models from scratch requires extensive data and computational resources.
We introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models.
Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy.
- Score: 58.075367597860044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at https://github.com/Adlith/MoE-Jetpack.
Related papers
- Attention Is All You Need For Mixture-of-Depths Routing [5.419910566904439]
We introduce a novel attention-based routing mechanism A-MoD.
A-MoD allows for more efficient training as it introduces no additional trainable parameters.
It can increase the performance of the MoD model.
arXiv Detail & Related papers (2024-12-30T11:25:54Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.
Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.
Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training [4.4345088842995395]
We propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems.
MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts.
We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process.
arXiv Detail & Related papers (2024-08-08T08:40:15Z) - Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [46.72960840801211]
Mixture-of-Experts(MoE) approach offers a promising way to scale Large Language Models(LLMs)
MoE suffers from significant memory overheads, necessitating model compression techniques.
This paper explores several MoE structure-aware quantizations, ranging from coarse to fine granularity, from MoE block to individual linear weight.
arXiv Detail & Related papers (2024-06-12T12:44:48Z) - Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training [45.97480866595295]
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant.
We adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range.
arXiv Detail & Related papers (2024-05-23T21:00:53Z) - MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public.
ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance.
This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [18.68993910156101]
We propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging.
We show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations.
arXiv Detail & Related papers (2023-02-20T11:18:24Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.