MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
- URL: http://arxiv.org/abs/2410.20771v1
- Date: Mon, 28 Oct 2024 06:14:12 GMT
- Title: MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
- Authors: Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, Róbert Csordás,
- Abstract summary: This work introduces MrT5 (MergeT5), a more efficient variant of ByT5.
MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length.
When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages.
- Score: 50.46453950887946
- License:
- Abstract: Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learnt delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively ``merges'' critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance. When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages, with significant additional improvements following multilingual training. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI and character-level tasks while reducing sequence lengths by up to 80%. Our approach presents a solution to the practical limitations of existing byte-level models.
Related papers
- BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models [77.0501668780182]
Retrieval augmentation addresses many critical problems in large language models.
Running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text.
We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages.
arXiv Detail & Related papers (2023-10-02T16:48:47Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - Reducing Sequence Length by Predicting Edit Operations with Large
Language Models [50.66922361766939]
This paper proposes predicting edit spans for the source text for local sequence transduction tasks.
We apply instruction tuning for Large Language Models on the supervision data of edit spans.
Experiments show that the proposed method achieves comparable performance to the baseline in four tasks.
arXiv Detail & Related papers (2023-05-19T17:51:05Z) - Evaluating Byte and Wordpiece Level Models for Massively Multilingual
Semantic Parsing [3.431659287330068]
We compare a byte-level (ByT5) and a wordpiece based (mT5) sequence to sequence model on the 51 languages of the MASSIVE multilingual semantic parsing dataset.
We are able to reduce the gap in exact match accuracy to only 5 points with respect to a model trained on gold data from all the languages.
arXiv Detail & Related papers (2022-12-14T13:48:32Z) - EdiT5: Semi-Autoregressive Text-Editing with T5 Warm-Start [21.4394742421462]
EdiT5 is a novel semi-autoregressive text-editing approach.
It combines the strengths of non-autoregressive text-editing and autoregressive decoding.
arXiv Detail & Related papers (2022-05-24T17:13:22Z) - Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text
Models [10.645591218689058]
We provide the first exploration of text-to-text transformers (T5) sentence embeddings.
We investigate three methods for extracting T5 sentence embeddings.
Our encoder-only models outperforms BERT-based sentence embeddings on both transfer tasks and semantic textual similarity.
arXiv Detail & Related papers (2021-08-19T18:58:02Z) - ByT5: Towards a token-free future with pre-trained byte-to-byte models [23.532359202069063]
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
We show that a standard Transformer architecture can be used with minimal modifications to process byte sequences.
We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
arXiv Detail & Related papers (2021-05-28T07:03:22Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.