MatFormer: Nested Transformer for Elastic Inference
- URL: http://arxiv.org/abs/2310.07707v1
- Date: Wed, 11 Oct 2023 17:57:14 GMT
- Title: MatFormer: Nested Transformer for Elastic Inference
- Authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen,
Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali
Farhadi, Prateek Jain
- Abstract summary: MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
- Score: 94.1789252941718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models are deployed in a wide range of settings, from
multi-accelerator clusters to standalone mobile phones. The diverse inference
constraints in these scenarios necessitate practitioners to train foundation
models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes.
Due to significant training costs, only a select few model sizes are trained
and supported, limiting more fine-grained control over relevant tradeoffs,
including latency, cost, and accuracy. This work introduces MatFormer, a nested
Transformer architecture designed to offer elasticity in a variety of
deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer
model is jointly optimized with a few nested smaller FFN blocks. This training
procedure allows for the Mix'n'Match of model granularities across layers --
i.e., a trained universal MatFormer model enables extraction of hundreds of
accurate smaller models, which were never explicitly optimized. We empirically
demonstrate MatFormer's effectiveness across different model classes (decoders
& encoders), modalities (language & vision), and scales (up to 2.6B
parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM)
allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting
comparable validation loss and one-shot downstream evaluations to their
independently trained counterparts. Furthermore, we observe that smaller
encoders extracted from a universal MatFormer-based ViT (MatViT) encoder
preserve the metric-space structure for adaptive large-scale retrieval.
Finally, we showcase that speculative decoding with the accurate and consistent
submodels extracted from MatFormer can further reduce inference latency.
Related papers
- MatMamba: A Matryoshka State Space Model [24.85566171753877]
MatMamba is a state space model which combines Matryoshka-style learning with Mamba2.
MatMamba allows for efficient and adaptive deployment across various model sizes.
We train language and image models at a variety of parameter sizes from 35M to 1.4B.
arXiv Detail & Related papers (2024-10-09T09:41:34Z) - Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications.
These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task.
One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.