Related papers: MatFormer: Nested Transformer for Elastic Inference

MatFormer: Nested Transformer for Elastic Inference

URL: http://arxiv.org/abs/2310.07707v1
Date: Wed, 11 Oct 2023 17:57:14 GMT
Title: MatFormer: Nested Transformer for Elastic Inference
Authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain
Abstract summary: MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
Score: 94.1789252941718
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting more fine-grained control over relevant tradeoffs, including latency, cost, and accuracy. This work introduces MatFormer, a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks. This training procedure allows for the Mix'n'Match of model granularities across layers -- i.e., a trained universal MatFormer model enables extraction of hundreds of accurate smaller models, which were never explicitly optimized. We empirically demonstrate MatFormer's effectiveness across different model classes (decoders & encoders), modalities (language & vision), and scales (up to 2.6B parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting comparable validation loss and one-shot downstream evaluations to their independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can further reduce inference latency.

Related papers

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers. By treating model parameters as tokens, we replace all the linear projections in Transformers. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z)
MatMamba: A Matryoshka State Space Model [24.85566171753877]
MatMamba is a state space model which combines Matryoshka-style learning with Mamba2. MatMamba allows for efficient and adaptive deployment across various model sizes. We train language and image models at a variety of parameter sizes from 35M to 1.4B.
arXiv Detail & Related papers (2024-10-09T09:41:34Z)
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs) Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z)
XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z)
Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications. These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z)
Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models. One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2. Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)
Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks. We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.