MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
- URL: http://arxiv.org/abs/2305.07185v2
- Date: Fri, 19 May 2023 21:09:11 GMT
- Title: MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
- Authors: Lili Yu, D\'aniel Simig, Colin Flaherty, Armen Aghajanyan, Luke
Zettlemoyer, Mike Lewis
- Abstract summary: Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling.
Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
- Score: 78.85346970193518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive transformers are spectacular models for short sequences but
scale poorly to long sequences such as high-resolution images, podcasts, code,
or books. We proposed Megabyte, a multi-scale decoder architecture that enables
end-to-end differentiable modeling of sequences of over one million bytes.
Megabyte segments sequences into patches and uses a local submodel within
patches and a global model between patches. This enables sub-quadratic
self-attention, much larger feedforward layers for the same compute, and
improved parallelism during decoding -- unlocking better performance at reduced
cost for both training and generation. Extensive experiments show that Megabyte
allows byte-level models to perform competitively with subword models on long
context language modeling, achieve state-of-the-art density estimation on
ImageNet, and model audio from raw files. Together, these results establish the
viability of tokenization-free autoregressive sequence modeling at scale.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.