Related papers: Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

URL: http://arxiv.org/abs/2406.20086v3
Date: Fri, 11 Oct 2024 16:20:16 GMT
Title: Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Authors: Sheridan Feucht, David Atkinson, Byron Wallace, David Bau,
Abstract summary: Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east" In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers.
Score: 20.1025293763531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

Related papers

Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters [25.430820735194768]
Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks.<n>We investigate how LLMs internally represent and utilize character-level information during the spelling-out process.
arXiv Detail & Related papers (2025-06-12T12:27:41Z)
TokAlign: Efficient Vocabulary Adaptation via Token Alignment [41.59130966729569]
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text.<n>In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM.<n>We propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view.
arXiv Detail & Related papers (2025-06-04T03:15:57Z)
Causal Estimation of Tokenisation Bias [58.20086589761273]
We quantify the effect of including or not a subword in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters.<n>We find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.<n> Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times.
arXiv Detail & Related papers (2025-06-03T17:59:47Z)
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier [4.300681074103876]
Pre-tokenization causes the distribution of tokens in a corpus to skew towards common, full-length words. We propose BoundlessB, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword.
arXiv Detail & Related papers (2025-03-31T19:36:29Z)
SuperBPE: Space Travel for Language Models [112.64910939119056]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm. SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z)
Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs) We form "semantic tokens" by merging the semantically similar subwords and their embeddings. inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence [6.991281327290525]
We propose a novel approach for learning interchangeable tokens in language models. Our method is designed to address alpha-equivalence, the principle that renaming bound variables in a syntactic expression preserves semantics.
arXiv Detail & Related papers (2024-10-22T16:34:36Z)
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization [3.0023392750520883]
My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. A tokenizer with a balanced token frequency distribution tends to work better.
arXiv Detail & Related papers (2024-10-19T04:06:09Z)
From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern LLMs process sub-words as input. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z)
CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. This raises the question: To what extent can LLMs learn orthographic information? We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z)
A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens [20.37803751979975]
When feeding a text into an embedding model, the obtained text embedding will be aligned with the key tokens in the input text. We show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. By adjusting the first principal component, we can align text embedding with the key tokens.
arXiv Detail & Related papers (2024-06-25T08:55:12Z)
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z)
Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2. Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction. This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations. Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.