Related papers: SpaceByte: Towards Deleting Tokenization from Large Language Modeling

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

URL: http://arxiv.org/abs/2404.14408v3
Date: Sun, 06 Oct 2024 02:17:26 GMT
Title: SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Authors: Kevin Slagle,
Abstract summary: Tokenization is widely used in large language models because it significantly improves performance. We propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling [10.985444895887207]
We introduce Over-Tokenized Transformers, a framework that decouples input and output vocabularies to improve language modeling performance. We uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design.
arXiv Detail & Related papers (2025-01-28T14:15:42Z)
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
Byte Latent Transformer: Patches Scale Better Than Tokens [101.10994909832063]
Byte Latent Transformer (BLT) encodes bytes into dynamically sized patches, which serve as the primary units of computation. For fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
arXiv Detail & Related papers (2024-12-13T05:33:32Z)
Layer-Condensed KV Cache for Efficient Inference of Large Language Models [44.24593677113768]
We propose a novel method that only computes and caches the KVs of a small number of layers. Our method achieves up to 26$times$ higher throughput than standard transformers.
arXiv Detail & Related papers (2024-05-17T08:59:46Z)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
Linear Attention via Orthogonal Memory [46.14195464583495]
We propose textbfLinear textbfAttention textbfVia textbfOrthogonal memory(shortname) to address these limitations. shortname achieves strong performance while maintaining linear complexity.
arXiv Detail & Related papers (2023-12-18T12:26:27Z)
BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models [77.0501668780182]
Retrieval augmentation addresses many critical problems in large language models. Running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages.
arXiv Detail & Related papers (2023-10-02T16:48:47Z)
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models. We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method. We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs. We propose a model accelaration approaches for large language models. Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z)
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling. Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
ByT5: Towards a token-free future with pre-trained byte-to-byte models [23.532359202069063]
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. We show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
arXiv Detail & Related papers (2021-05-28T07:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.