Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
- URL: http://arxiv.org/abs/2406.20086v3
- Date: Fri, 11 Oct 2024 16:20:16 GMT
- Title: Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
- Authors: Sheridan Feucht, David Atkinson, Byron Wallace, David Bau,
- Abstract summary: Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east"
In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers.
- Score: 20.1025293763531
- License:
- Abstract: LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence [6.991281327290525]
We propose a novel approach for learning interchangeable tokens in language models.
Our method is designed to address alpha-equivalence, the principle that renaming bound variables in a syntactic expression preserves semantics.
arXiv Detail & Related papers (2024-10-22T16:34:36Z) - Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization [3.0023392750520883]
My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers.
The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers.
A tokenizer with a balanced token frequency distribution tends to work better.
arXiv Detail & Related papers (2024-10-19T04:06:09Z) - From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern LLMs process sub-words as input.
We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations.
Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z) - CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks.
This raises the question: To what extent can LLMs learn orthographic information?
We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z) - A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens [20.37803751979975]
When feeding a text into an embedding model, the obtained text embedding will be aligned with the key tokens in the input text.
We show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method.
By adjusting the first principal component, we can align text embedding with the key tokens.
arXiv Detail & Related papers (2024-06-25T08:55:12Z) - Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning.
In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z) - Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2.
Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction.
This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.