Related papers: MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

URL: http://arxiv.org/abs/2406.04801v1
Date: Fri, 7 Jun 2024 10:05:42 GMT
Title: MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
Authors: Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai,
Abstract summary: Training MoE models from scratch requires extensive data and computational resources. We introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy.
Score: 58.075367597860044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at https://github.com/Adlith/MoE-Jetpack.

Related papers

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE [16.413800846658564]
Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss.<n>We show that under medium batch sizes, MoE surprisingly benefits more from SD than dense models.<n>We introduce a new metric 'target efficiency' that characterizes these effects.
arXiv Detail & Related papers (2025-05-26T08:01:45Z)
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching [2.543762777822215]
MoE-Gen is a high- throughput MoE inference system for singleGPU execution. We introduce module-based tokens, which accumulates in host memory and dynamically launches large batches on to maximize utilization. MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems.
arXiv Detail & Related papers (2025-03-12T18:08:01Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
Attention Is All You Need For Mixture-of-Depths Routing [5.419910566904439]
We introduce a novel attention-based routing mechanism A-MoD. A-MoD allows for more efficient training as it introduces no additional trainable parameters. It can increase the performance of the MoD model.
arXiv Detail & Related papers (2024-12-30T11:25:54Z)
EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z)
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training [4.4345088842995395]
We propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process.
arXiv Detail & Related papers (2024-08-08T08:40:15Z)
LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z)
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [46.72960840801211]
Mixture-of-Experts(MoE) approach offers a promising way to scale Large Language Models(LLMs) MoE suffers from significant memory overheads, necessitating model compression techniques. This paper explores several MoE structure-aware quantizations, ranging from coarse to fine granularity, from MoE block to individual linear weight.
arXiv Detail & Related papers (2024-06-12T12:44:48Z)
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training [45.97480866595295]
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. We adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range.
arXiv Detail & Related papers (2024-05-23T21:00:53Z)
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public. ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [18.68993910156101]
We propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging. We show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations.
arXiv Detail & Related papers (2023-02-20T11:18:24Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.