Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
- URL: http://arxiv.org/abs/2411.04996v1
- Date: Thu, 07 Nov 2024 18:59:06 GMT
- Title: Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
- Authors: Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin,
- Abstract summary: Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture.
MoT decouples non-embedding parameters of the model by modality.
We evaluate MoT across multiple settings and model scales.
- Score: 111.97026994761254
- License:
- Abstract: The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
Related papers
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture [18.459825048813336]
LongLLaVA is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness.
It could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
arXiv Detail & Related papers (2024-09-04T17:25:21Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - On the Scalability of Diffusion-based Text-to-Image Generation [97.64837704129005]
We study scaling properties of diffusion based text-to-image (T2I) models.
For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs.
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size.
arXiv Detail & Related papers (2024-04-03T17:34:28Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - RingMo-lite: A Remote Sensing Multi-task Lightweight Network with
CNN-Transformer Hybrid Framework [15.273362355253779]
This paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework to optimize the interpretation process.
The proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size.
arXiv Detail & Related papers (2023-09-16T14:15:59Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for
Speech-to-Image Generation [8.26410341981427]
The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal.
We propose a single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples.
arXiv Detail & Related papers (2023-05-17T11:12:07Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - DIME-FM: DIstilling Multimodal and Efficient Foundation Models [72.1900621000677]
Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
arXiv Detail & Related papers (2023-03-31T17:47:23Z) - Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation [1.0499611180329804]
We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner.
We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect.
Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-02-23T05:34:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.