Related papers: LBPE: Long-token-first Tokenization to Improve Large Language Models

LBPE: Long-token-first Tokenization to Improve Large Language Models

URL: http://arxiv.org/abs/2411.05504v1
Date: Fri, 08 Nov 2024 12:03:36 GMT
Title: LBPE: Long-token-first Tokenization to Improve Large Language Models
Authors: Haoran Lian, Yizhe Xiong, Zijia Lin, Jianwei Niu, Shasha Mo, Hui Chen, Peng Liu, Guiguang Ding,
Abstract summary: Long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens. We propose LBPE, which prioritizes long tokens during the encoding process. Experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE.
Score: 26.3619552256488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens, which can result in imbalanced learning issue across different tokens. To address that, we propose LBPE, which prioritizes long tokens during the encoding process. LBPE generates tokens according to their reverse ranks of token length rather than their ranks in the vocabulary, granting longer tokens higher priority during the encoding process. Consequently, LBPE smooths the frequency differences between short and long tokens, and thus mitigates the learning imbalance. Extensive experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE, well demonstrating its effectiveness.

Related papers

MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs) We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization. We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z)
SuperBPE: Space Travel for Language Models [112.64910939119056]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm. SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z)
Beyond Literal Token Overlap: Token Alignability for Multilinguality [53.680462160878925]
We propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer.
arXiv Detail & Related papers (2025-02-10T13:50:12Z)
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers [32.274579719726546]
Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. Recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. We investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization.
arXiv Detail & Related papers (2024-10-31T07:19:44Z)
Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm. It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method. On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair. (BPE) originate from the field of data compression. We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z)
Toucan: Token-Aware Character Level Language Modeling [44.85590844938571]
Toucan is an augmentation to character-level models to make them "token-aware" We show significant speed-ups in character generation without a loss in language modeling performance. Our approach leads to a greater amount of longer sequences tokenized as single items.
arXiv Detail & Related papers (2023-11-15T00:57:51Z)
Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition [20.926163659469587]
We propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. Our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate.
arXiv Detail & Related papers (2022-12-30T22:26:57Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies. vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies. Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z)
Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE) We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.