Batching BPE Tokenization Merges
- URL: http://arxiv.org/abs/2408.04653v1
- Date: Mon, 5 Aug 2024 09:37:21 GMT
- Title: Batching BPE Tokenization Merges
- Authors: Alexander P. Morgan,
- Abstract summary: BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
- Score: 55.2480439325792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.
Related papers
- LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers [76.59130257385826]
Intermediate merge residues in BPE vocabularies are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage.<n>We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens.<n>Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
arXiv Detail & Related papers (2026-02-04T16:19:05Z) - AdaptBPE: From General Purpose to Specialized Tokenizers [18.70903226766322]
We propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus.<n>Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size.<n>This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks.
arXiv Detail & Related papers (2026-01-29T12:59:40Z) - A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation [69.50397417361351]
Text hashing projects original texts into compact binary hash codes.<n>Deep text hashing has demonstrated significant advantages over traditional, data-independent hashing techniques.<n>This survey investigates current deep text hashing methods by categorizing them based on their core components.
arXiv Detail & Related papers (2025-10-31T06:51:37Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)
We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization.
We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier [4.300681074103876]
Pre-tokenization causes the distribution of tokens in a corpus to skew towards common, full-length words.
We propose BoundlessB, a modified BPE algorithm that relaxes the pretoken boundary constraint.
Our approach selectively merges two complete pretokens into a larger unit we term a superword.
arXiv Detail & Related papers (2025-03-31T19:36:29Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training [8.012203293561196]
Picky BPE is a modified BPE algorithm that carries out vocabulary refinement during tokenizer training.
Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression.
arXiv Detail & Related papers (2024-09-06T20:12:34Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system [10.70500939394669]
tokenization algorithms like Byte Pair Piece (BPE) and WordPiece are popular in identifying the tokens that are used in the overall training process of the speech recognition system.
We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully.
arXiv Detail & Related papers (2024-04-29T12:16:21Z) - Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method.
On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.