Batching BPE Tokenization Merges
- URL: http://arxiv.org/abs/2408.04653v1
- Date: Mon, 5 Aug 2024 09:37:21 GMT
- Title: Batching BPE Tokenization Merges
- Authors: Alexander P. Morgan,
- Abstract summary: BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
- Score: 55.2480439325792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.
Related papers
- Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training [8.012203293561196]
Picky BPE is a modified BPE algorithm that carries out vocabulary refinement during tokenizer training.
Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression.
arXiv Detail & Related papers (2024-09-06T20:12:34Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [68.68025991850115]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system [10.70500939394669]
tokenization algorithms like Byte Pair Piece (BPE) and WordPiece are popular in identifying the tokens that are used in the overall training process of the speech recognition system.
We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully.
arXiv Detail & Related papers (2024-04-29T12:16:21Z) - Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal [25.406520591282366]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE.
On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.