Related papers: Tokenization and the Noiseless Channel

Tokenization and the Noiseless Channel

URL: http://arxiv.org/abs/2306.16842v1
Date: Thu, 29 Jun 2023 10:32:09 GMT
Title: Tokenization and the Noiseless Channel
Authors: Vil\'em Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell
Abstract summary: Good tokenizers lead to emphefficient channel usage, where the channel is the means by which some input is conveyed to the model. In machine translation, we find that across multiple tokenizers, the R'enyi entropy with $alpha = 2.5$ has a very strong correlation with textscBleu: $0.78$ in comparison to just $-0.32$ for compressed length.
Score: 71.25796813073399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.

Related papers

CODA: Repurposing Continuous VAEs for Discrete Tokenization [52.58960429582813]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization. Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z)
Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Extensions to Distributed Optimization and Comparison Oracle [77.3806516979843]
We show that SignSGD achieves optimal sample complexity $tildeO(varepsilon-frac3kappa - 2kappa 1right)$ with high accuracy. We also explore the application of the sign operator in zeroth-order optimization with an oracle that can only compare function values at two different points.
arXiv Detail & Related papers (2025-02-11T19:54:11Z)
ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
We propose ZETA to enable parallel querying of past tokens for entire sequences. ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
arXiv Detail & Related papers (2025-01-24T15:33:05Z)
HashAttention: Semantic Sparsity for Faster Inference [91.54218318798603]
HashAttention is a principled approach casting pivotal token identification as a recommendation problem. It efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations. It can reduce the number of tokens used by a factor of $1/32times$ for the Llama-3.1-8B model with LongBench.
arXiv Detail & Related papers (2024-12-19T02:34:15Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z)
Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
Channel simulation algorithms can efficiently encode random samples from a prescribed target distribution $Q$ and find applications in machine learning-based lossy data compression. This paper considers approximate schemes with a fixed runtime instead. We exploit global-bound, depth-limited A* coding to ensure $mathrmTV[Q Vert P] leq epsilon$ and maintain optimal coding performance with a sample complexity of only $expbig((D_KL[Q Vert P] + o(1)) big/ epsilonbig
arXiv Detail & Related papers (2024-05-07T14:44:41Z)
The closed-branch decoder for quantum LDPC codes [0.0]
Real-time decoding is a necessity for implementing arbitrary quantum computations on the logical level. We present a new decoder for Quantum Low Density Parity Check (QLDPC) codes, named the closed-branch decoder.
arXiv Detail & Related papers (2024-02-02T16:22:32Z)
A lower bound on the space overhead of fault-tolerant quantum computation [51.723084600243716]
The threshold theorem is a fundamental result in the theory of fault-tolerant quantum computation. We prove an exponential upper bound on the maximal length of fault-tolerant quantum computation with amplitude noise.
arXiv Detail & Related papers (2022-01-31T22:19:49Z)
Random quantum circuits transform local noise into global white noise [118.18170052022323]
We study the distribution over measurement outcomes of noisy random quantum circuits in the low-fidelity regime. For local noise that is sufficiently weak and unital, correlations (measured by the linear cross-entropy benchmark) between the output distribution $p_textnoisy$ of a generic noisy circuit instance shrink exponentially. If the noise is incoherent, the output distribution approaches the uniform distribution $p_textunif$ at precisely the same rate.
arXiv Detail & Related papers (2021-11-29T19:26:28Z)
Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms. We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$. In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z)
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation [12.005340904206697]
CANINE is a neural encoder that operates directly on character sequences without explicit tokenization or vocabulary. CanINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark.
arXiv Detail & Related papers (2021-03-11T18:57:44Z)
From Information Theory Puzzles in Deletion Channels to Deniability in Quantum Cryptography [0.0]
We first conjecture on the basis of experimental data that the entropy of the posterior is minimized by the constant strings. We then establish a connection between covert communication and deniability to propose DC-QKE. We present an efficient coercion-resistant and quantum-secure voting scheme, based on fully homomorphic encryption.
arXiv Detail & Related papers (2020-03-25T22:20:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.