Related papers: MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

URL: http://arxiv.org/abs/2407.21770v3
Date: Mon, 12 Aug 2024 16:20:37 GMT
Title: MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Authors: Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan,
Abstract summary: MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
Score: 95.26323548734692
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

Related papers

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer [5.585222292493927]
We propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement selective routing on input data and experts. Experiments demonstrate that UoE model surpass Full Attention, state-of-art MoEs and efficient transformers.
arXiv Detail & Related papers (2025-03-04T11:01:25Z)
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity [56.0251572416922]
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling. We propose a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings.
arXiv Detail & Related papers (2025-01-27T18:35:05Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens. MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z)
Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z)
Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection [16.539855450082946]
We introduce a novel framework that redefines MoE routing through affinity-driven active selection. Our theoretical analysis demonstrates that this approach mitigates expert homogenization while enabling substantial capacity boundary reduction. After supervised fine-tuning, it exhibits performance gains of 9.7% to 14.1% across GDAD, C-Eval, and TeleQnA benchmarks.
arXiv Detail & Related papers (2024-05-24T02:50:44Z)
Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs) We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z)
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters [11.05223262950967]
Mixture of Experts (MoE) architectures have recently started burgeoning due to their ability to scale model's capacity while maintaining the computational cost affordable. This paper attempts to demystify the use of MoE for parameter-efficient fine-tuning of Audio Spectrogram Transformers to audio and speech downstream tasks. It exploits adapters as the experts and, leveraging the recent Soft MoE method, it relies on a soft assignment between the input tokens and experts to keep the computational time limited.
arXiv Detail & Related papers (2024-02-01T18:16:04Z)
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models. By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z)
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts [26.041404520616073]
We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost.
arXiv Detail & Related papers (2022-06-06T17:51:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.