Tokenization and the Noiseless Channel
- URL: http://arxiv.org/abs/2306.16842v1
- Date: Thu, 29 Jun 2023 10:32:09 GMT
- Title: Tokenization and the Noiseless Channel
- Authors: Vil\'em Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya
Sachan, Ryan Cotterell
- Abstract summary: Good tokenizers lead to emphefficient channel usage, where the channel is the means by which some input is conveyed to the model.
In machine translation, we find that across multiple tokenizers, the R'enyi entropy with $alpha = 2.5$ has a very strong correlation with textscBleu: $0.78$ in comparison to just $-0.32$ for compressed length.
- Score: 71.25796813073399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subword tokenization is a key part of many NLP pipelines. However, little is
known about why some tokenizer and hyperparameter combinations lead to better
downstream model performance than others. We propose that good tokenizers lead
to \emph{efficient} channel usage, where the channel is the means by which some
input is conveyed to the model and efficiency can be quantified in
information-theoretic terms as the ratio of the Shannon entropy to the maximum
possible entropy of the token distribution. Yet, an optimal encoding according
to Shannon entropy assigns extremely long codes to low-frequency tokens and
very short codes to high-frequency tokens. Defining efficiency in terms of
R\'enyi entropy, on the other hand, penalizes distributions with either very
high or very low-frequency tokens. In machine translation, we find that across
multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong
correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for
compressed length.
Related papers
- Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Extensions to Distributed Optimization and Comparison Oracle [77.3806516979843]
We show that SignSGD achieves optimal sample complexity $tildeO(varepsilon-frac3kappa - 2kappa 1right)$ with high accuracy.
We also explore the application of the sign operator in zeroth-order optimization with an oracle that can only compare function values at two different points.
arXiv Detail & Related papers (2025-02-11T19:54:11Z) - ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
We propose ZETA to enable parallel querying of past tokens for entire sequences.
ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
arXiv Detail & Related papers (2025-01-24T15:33:05Z) - HashAttention: Semantic Sparsity for Faster Inference [91.54218318798603]
HashAttention is a principled approach casting pivotal token identification as a recommendation problem.
It efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations.
It can reduce the number of tokens used by a factor of $1/32times$ for the Llama-3.1-8B model with LongBench.
arXiv Detail & Related papers (2024-12-19T02:34:15Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
Channel simulation algorithms can efficiently encode random samples from a prescribed target distribution $Q$ and find applications in machine learning-based lossy data compression.
This paper considers approximate schemes with a fixed runtime instead.
We exploit global-bound, depth-limited A* coding to ensure $mathrmTV[Q Vert P] leq epsilon$ and maintain optimal coding performance with a sample complexity of only $expbig((D_KL[Q Vert P] + o(1)) big/ epsilonbig
arXiv Detail & Related papers (2024-05-07T14:44:41Z) - A lower bound on the space overhead of fault-tolerant quantum computation [51.723084600243716]
The threshold theorem is a fundamental result in the theory of fault-tolerant quantum computation.
We prove an exponential upper bound on the maximal length of fault-tolerant quantum computation with amplitude noise.
arXiv Detail & Related papers (2022-01-31T22:19:49Z) - Random quantum circuits transform local noise into global white noise [118.18170052022323]
We study the distribution over measurement outcomes of noisy random quantum circuits in the low-fidelity regime.
For local noise that is sufficiently weak and unital, correlations (measured by the linear cross-entropy benchmark) between the output distribution $p_textnoisy$ of a generic noisy circuit instance shrink exponentially.
If the noise is incoherent, the output distribution approaches the uniform distribution $p_textunif$ at precisely the same rate.
arXiv Detail & Related papers (2021-11-29T19:26:28Z) - CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation [12.005340904206697]
CANINE is a neural encoder that operates directly on character sequences without explicit tokenization or vocabulary.
CanINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark.
arXiv Detail & Related papers (2021-03-11T18:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.