Where is the signal in tokenization space?
- URL: http://arxiv.org/abs/2408.08541v1
- Date: Fri, 16 Aug 2024 05:56:10 GMT
- Title: Where is the signal in tokenization space?
- Authors: Renato Lui Geh, Honghua Zhang, Kareem Ahmed, Benjie Wang, Guy Van den Broeck,
- Abstract summary: Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences.
In this paper, we study non-canonical tokenizations.
- Score: 31.016041295876864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a piece of text is the probability of its canonical token sequence. However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text. In this paper, we study non-canonical tokenizations. We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations. We then show how the marginal is, in most cases, indistinguishable from the canonical probability. Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space. Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.
Related papers
- Token embeddings violate the manifold hypothesis [1.5621144215664768]
We elucidate the structure of the token embeddings, the input domain for large language models.
We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions.
arXiv Detail & Related papers (2025-04-01T17:40:12Z) - Language Model Uncertainty Quantification with Attention Chain [9.093726246465117]
A large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers.
We propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization.
We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs.
arXiv Detail & Related papers (2025-03-24T21:43:47Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)
We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.
PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers [32.274579719726546]
Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens.
Recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors.
We investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization.
arXiv Detail & Related papers (2024-10-31T07:19:44Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Transformers are Universal In-context Learners [21.513210412394965]
We show that deep transformers can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains.
A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens.
arXiv Detail & Related papers (2024-08-02T16:21:48Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z) - Text vectorization via transformer-based language models and n-gram
perplexities [0.0]
Given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation.
This research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input.
arXiv Detail & Related papers (2023-07-18T13:38:39Z) - Should you marginalize over possible tokenizations? [13.07994518230055]
We show that the gap in log-likelihood is no larger than 0.5% in most cases.
Our results show that the gap in log-likelihood is no larger than 0.5% in most cases.
arXiv Detail & Related papers (2023-06-30T16:09:01Z) - Tokenization and the Noiseless Channel [71.25796813073399]
Good tokenizers lead to emphefficient channel usage, where the channel is the means by which some input is conveyed to the model.
In machine translation, we find that across multiple tokenizers, the R'enyi entropy with $alpha = 2.5$ has a very strong correlation with textscBleu: $0.78$ in comparison to just $-0.32$ for compressed length.
arXiv Detail & Related papers (2023-06-29T10:32:09Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.