Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
- URL: http://arxiv.org/abs/2310.12109v1
- Date: Wed, 18 Oct 2023 17:06:22 GMT
- Title: Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
- Authors: Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri
Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra,
Christopher R\'e
- Abstract summary: We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension.
As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style classification, and causal GPT-style language modeling.
For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in GLUE quality with up to 27% fewer parameters, and up to 9.1$times higher throughput at sequence length 4K
- Score: 31.763186154430347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning models are increasingly being scaled in both sequence length
and model dimension to reach longer contexts and better performance. However,
existing architectures such as Transformers scale quadratically along both
these axes. We ask: are there performant architectures that can scale
sub-quadratically along sequence length and model dimension? We introduce
Monarch Mixer (M2), a new architecture that uses the same sub-quadratic
primitive along both sequence length and model dimension: Monarch matrices, a
simple class of expressive structured matrices that captures many linear
transforms, achieves high hardware efficiency on GPUs, and scales
sub-quadratically. As a proof of concept, we explore the performance of M2 in
three domains: non-causal BERT-style language modeling, ViT-style image
classification, and causal GPT-style language modeling. For non-causal
BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE
quality with up to 27% fewer parameters, and achieves up to 9.1$\times$ higher
throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in
accuracy, with only half the parameters. Causal GPT-style models introduce a
technical challenge: enforcing causality via masking introduces a quadratic
bottleneck. To alleviate this bottleneck, we develop a novel theoretical view
of Monarch matrices based on multivariate polynomial evaluation and
interpolation, which lets us parameterize M2 to be causal while remaining
sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers
at 360M parameters in pretraining perplexity on The PILE--showing for the first
time that it may be possible to match Transformer quality without attention or
MLPs.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection [5.37935922811333]
MambaMixer is a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels.
As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block.
arXiv Detail & Related papers (2024-03-29T00:05:13Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z) - Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model.
An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads.
Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.