Related papers: A partition cover approach to tokenization

A partition cover approach to tokenization

URL: http://arxiv.org/abs/2501.06246v2
Date: Sun, 25 May 2025 15:39:07 GMT
Title: A partition cover approach to tokenization
Authors: Jia Peng Lim, Shawn Tan, Davin Choo, Hady W. Lauw,
Abstract summary: Tokenization is a process of encoding strings into tokens of a fixed vocabulary size.<n>Byte-Pair corpora (BPE) formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges.<n>We show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC.
Score: 27.78022124795594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

Related papers

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [62.35048154917945]
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines.<n>Standard algorithms for learning tokenizers rely on frequency-based objectives.<n>We introduce Parity-aware Byte Pair.<n>We find empirically that Parity-aware BPE leads to more equitable token counts across languages.
arXiv Detail & Related papers (2025-08-06T18:14:43Z)
Single-pass Adaptive Image Tokenization for Minimum Program Search [75.59409288259151]
We propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass.<n> KARL matches the performance of recent adaptive tokenizers while operating in a single pass.
arXiv Detail & Related papers (2025-07-10T17:59:53Z)
Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search [32.63323958382152]
This paper focuses on token packetization to maximize the average token similarity (ATS) between the original and received token channels.<n>To address this, we propose a novel framework of semantic aggregation with lookahead search (SemPA-Look)<n>SemPA-Look applies a lookahead search-inspired algorithm that samples intra-packet token candidates without replacement.
arXiv Detail & Related papers (2025-06-24T09:25:44Z)
Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit [45.18582668677648]
We present a training-free method to transplant tokenizers in large language models.<n>We approximate each out-of-vocabulary token as a sparse linear combination of shared tokens.<n>We show that OMP achieves best zero-shot preservation of the base model's performance.
arXiv Detail & Related papers (2025-06-07T00:51:27Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier [4.300681074103876]
Pre-tokenization causes the distribution of tokens in a corpus to skew towards common, full-length words.<n>We propose BoundlessB, a modified BPE algorithm that relaxes the pretoken boundary constraint.<n>Our approach selectively merges two complete pretokens into a larger unit we term a superword.
arXiv Detail & Related papers (2025-03-31T19:36:29Z)
Theoretical Analysis of Byte-Pair Encoding [0.8655526882770742]
Byte-Pair (BPE) is a widely used method for subword tokenization. We show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor.
arXiv Detail & Related papers (2024-11-13T15:04:02Z)
Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language. We show that Byte-Pair. Match (BPE) and MaxPiece (WordPiece) fit within this framework. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z)
Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm. It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z)
Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$. We first design novel efficient algorithms to privately release pair-wise cosine similarities. We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z)
Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem. It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z)
DISCO : efficient unsupervised decoding for discrete natural language problems via convex relaxation [1.370633147306388]
We study test time decoding; an ubiquitous step in almost all sequential text generation task spanning across a wide array of natural language processing (NLP) problems. Our main contribution is to develop a continuous relaxation framework for the NP-hard decoding problem and propose Disco - an efficient algorithm based on standard first order gradient based.
arXiv Detail & Related papers (2021-07-07T00:40:25Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Revisiting Modified Greedy Algorithm for Monotone Submodular Maximization with a Knapsack Constraint [75.85952446237599]
We show that a modified greedy algorithm can achieve an approximation factor of $0.305$. We derive a data-dependent upper bound on the optimum. It can also be used to significantly improve the efficiency of such algorithms as branch and bound.
arXiv Detail & Related papers (2020-08-12T15:40:21Z)
Accelerated Message Passing for Entropy-Regularized MAP Inference [89.15658822319928]
Maximum a posteriori (MAP) inference in discrete-valued random fields is a fundamental problem in machine learning. Due to the difficulty of this problem, linear programming (LP) relaxations are commonly used to derive specialized message passing algorithms. We present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient.
arXiv Detail & Related papers (2020-07-01T18:43:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.