ByT5: Towards a token-free future with pre-trained byte-to-byte models
- URL: http://arxiv.org/abs/2105.13626v1
- Date: Fri, 28 May 2021 07:03:22 GMT
- Title: ByT5: Towards a token-free future with pre-trained byte-to-byte models
- Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang,
Mihir Kale, Adam Roberts, Colin Raffel
- Abstract summary: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
We show that a standard Transformer architecture can be used with minimal modifications to process byte sequences.
We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
- Score: 23.532359202069063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most widely-used pre-trained language models operate on sequences of tokens
corresponding to word or subword units. Encoding text as a sequence of tokens
requires a tokenizer, which is typically created as an independent artifact
from the model. Token-free models that instead operate directly on raw text
(bytes or characters) have many benefits: they can process text in any language
out of the box, they are more robust to noise, and they minimize technical debt
by removing complex and error-prone text preprocessing pipelines. Since byte or
character sequences are longer than token sequences, past work on token-free
models has often introduced new model architectures designed to amortize the
cost of operating directly on raw text. In this paper, we show that a standard
Transformer architecture can be used with minimal modifications to process byte
sequences. We carefully characterize the trade-offs in terms of parameter
count, training FLOPs, and inference speed, and show that byte-level models are
competitive with their token-level counterparts. We also demonstrate that
byte-level models are significantly more robust to noise and perform better on
tasks that are sensitive to spelling and pronunciation. As part of our
contribution, we release a new set of pre-trained byte-level Transformer models
based on the T5 architecture, as well as all code and data used in our
experiments.
Related papers
- MrT5: Dynamic Token Merging for Efficient Byte-level Language Models [50.46453950887946]
This work introduces MrT5 (MergeT5), a more efficient variant of ByT5.
MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length.
When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages.
arXiv Detail & Related papers (2024-10-28T06:14:12Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information [29.633735942273997]
XRayEmb is a method for retrofitting existing token-based models with character-level information.
We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
arXiv Detail & Related papers (2021-08-01T08:09:26Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation [12.005340904206697]
CANINE is a neural encoder that operates directly on character sequences without explicit tokenization or vocabulary.
CanINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark.
arXiv Detail & Related papers (2021-03-11T18:57:44Z) - Neural Machine Translation without Embeddings [44.129310924201604]
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and subword induction algorithms.
A simple universal alternative is to represent every computerized text as a sequence of bytes via-8.
Experiments on byteto-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models.
arXiv Detail & Related papers (2020-08-21T09:54:11Z) - Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning
Subword Systems [78.80826533405019]
We show that we can obtain a neural machine translation model that works at the character level without requiring token segmentation.
Our study is a significant step towards high-performance and easy to train character-based models that are not extremely large.
arXiv Detail & Related papers (2020-04-29T15:56:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.