Tokenization and the Noiseless Channel
- URL: http://arxiv.org/abs/2306.16842v1
- Date: Thu, 29 Jun 2023 10:32:09 GMT
- Title: Tokenization and the Noiseless Channel
- Authors: Vil\'em Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya
Sachan, Ryan Cotterell
- Abstract summary: Good tokenizers lead to emphefficient channel usage, where the channel is the means by which some input is conveyed to the model.
In machine translation, we find that across multiple tokenizers, the R'enyi entropy with $alpha = 2.5$ has a very strong correlation with textscBleu: $0.78$ in comparison to just $-0.32$ for compressed length.
- Score: 71.25796813073399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subword tokenization is a key part of many NLP pipelines. However, little is
known about why some tokenizer and hyperparameter combinations lead to better
downstream model performance than others. We propose that good tokenizers lead
to \emph{efficient} channel usage, where the channel is the means by which some
input is conveyed to the model and efficiency can be quantified in
information-theoretic terms as the ratio of the Shannon entropy to the maximum
possible entropy of the token distribution. Yet, an optimal encoding according
to Shannon entropy assigns extremely long codes to low-frequency tokens and
very short codes to high-frequency tokens. Defining efficiency in terms of
R\'enyi entropy, on the other hand, penalizes distributions with either very
high or very low-frequency tokens. In machine translation, we find that across
multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong
correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for
compressed length.
Related papers
- ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
Channel simulation algorithms can efficiently encode random samples from a prescribed target distribution $Q$ and find applications in machine learning-based lossy data compression.
This paper considers approximate schemes with a fixed runtime instead.
We exploit global-bound, depth-limited A* coding to ensure $mathrmTV[Q Vert P] leq epsilon$ and maintain optimal coding performance with a sample complexity of only $expbig((D_KL[Q Vert P] + o(1)) big/ epsilonbig
arXiv Detail & Related papers (2024-05-07T14:44:41Z) - The closed-branch decoder for quantum LDPC codes [0.0]
Real-time decoding is a necessity for implementing arbitrary quantum computations on the logical level.
We present a new decoder for Quantum Low Density Parity Check (QLDPC) codes, named the closed-branch decoder.
arXiv Detail & Related papers (2024-02-02T16:22:32Z) - A lower bound on the space overhead of fault-tolerant quantum computation [51.723084600243716]
The threshold theorem is a fundamental result in the theory of fault-tolerant quantum computation.
We prove an exponential upper bound on the maximal length of fault-tolerant quantum computation with amplitude noise.
arXiv Detail & Related papers (2022-01-31T22:19:49Z) - Random quantum circuits transform local noise into global white noise [118.18170052022323]
We study the distribution over measurement outcomes of noisy random quantum circuits in the low-fidelity regime.
For local noise that is sufficiently weak and unital, correlations (measured by the linear cross-entropy benchmark) between the output distribution $p_textnoisy$ of a generic noisy circuit instance shrink exponentially.
If the noise is incoherent, the output distribution approaches the uniform distribution $p_textunif$ at precisely the same rate.
arXiv Detail & Related papers (2021-11-29T19:26:28Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation [12.005340904206697]
CANINE is a neural encoder that operates directly on character sequences without explicit tokenization or vocabulary.
CanINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark.
arXiv Detail & Related papers (2021-03-11T18:57:44Z) - From Information Theory Puzzles in Deletion Channels to Deniability in
Quantum Cryptography [0.0]
We first conjecture on the basis of experimental data that the entropy of the posterior is minimized by the constant strings.
We then establish a connection between covert communication and deniability to propose DC-QKE.
We present an efficient coercion-resistant and quantum-secure voting scheme, based on fully homomorphic encryption.
arXiv Detail & Related papers (2020-03-25T22:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.