Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
- URL: http://arxiv.org/abs/2411.04996v1
- Date: Thu, 07 Nov 2024 18:59:06 GMT
- Title: Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
- Authors: Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin,
- Abstract summary: Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture.
MoT decouples non-embedding parameters of the model by modality.
We evaluate MoT across multiple settings and model scales.
- Score: 111.97026994761254
- License:
- Abstract: The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
Related papers
- LMFusion: Adapting Pretrained Language Models for Multimodal Generation [81.78257799283777]
We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities.
Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LMFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs.
arXiv Detail & Related papers (2024-12-19T18:56:24Z) - Efficient Scaling of Diffusion Transformers for Text-to-Image Generation [105.7324182618969]
We study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations.
We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants.
arXiv Detail & Related papers (2024-12-16T22:59:26Z) - LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture [18.459825048813336]
LongLLaVA is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness.
It could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
arXiv Detail & Related papers (2024-09-04T17:25:21Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - On the Scalability of Diffusion-based Text-to-Image Generation [97.64837704129005]
We study scaling properties of diffusion based text-to-image (T2I) models.
For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs.
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size.
arXiv Detail & Related papers (2024-04-03T17:34:28Z) - RingMo-lite: A Remote Sensing Multi-task Lightweight Network with
CNN-Transformer Hybrid Framework [15.273362355253779]
This paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework to optimize the interpretation process.
The proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size.
arXiv Detail & Related papers (2023-09-16T14:15:59Z) - Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for
Speech-to-Image Generation [8.26410341981427]
The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal.
We propose a single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples.
arXiv Detail & Related papers (2023-05-17T11:12:07Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.