Related papers: Token embeddings violate the manifold hypothesis

Token embeddings violate the manifold hypothesis

URL: http://arxiv.org/abs/2504.01002v1
Date: Tue, 01 Apr 2025 17:40:12 GMT
Title: Token embeddings violate the manifold hypothesis
Authors: Michael Robinson, Sourya Dey, Tony Chiang,
Abstract summary: We elucidate the structure of the token embeddings, the input domain for large language models.<n>We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions.
Score: 1.5621144215664768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions. This model is based on a generalization of a manifold called a fiber bundle, so we denote our hypothesis test as the ``fiber bundle null.'' Failing to reject the null is uninformative, but rejecting it at a specific token indicates that token has a statistically significant local structure, and so is of interest to us. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the token subspace is provably not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, and if one prompt contains a token implicated by our test, that prompt will likely exhibit more output variability proportional to the local signal dimension of the token.

Related papers

Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored.<n>We provide both theoretical insights and empirical validation across a range of recent models.<n>We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z)
Forking Paths in Neural Text Generation [14.75166317633176]
We develop a novel approach to representing uncertainty dynamics across individual tokens of text generation.<n>We use our method to analyze LLM responses on 7 different tasks across 4 domains.<n>We find many examples of forking tokens, including surprising ones such as punctuation marks.
arXiv Detail & Related papers (2024-12-10T22:57:57Z)
Non-Halting Queries: Exploiting Fixed Points in LLMs [4.091772241106195]
We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt.<n>We rigorously analyze the conditions under which the non-halting anomaly presents itself.<n>We demonstrate non-halting queries in many experiments performed in base unaligned models.
arXiv Detail & Related papers (2024-10-08T18:38:32Z)
Where is the signal in tokenization space? [31.016041295876864]
Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences. In this paper, we study non-canonical tokenizations.
arXiv Detail & Related papers (2024-08-16T05:56:10Z)
VeriFlow: Modeling Distributions for Neural Network Verification [4.3012765978447565]
Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. We propose the VeriFlow architecture as a flow based density model tailored to allow any verification approach to restrict its search to the some data distribution of interest.
arXiv Detail & Related papers (2024-06-20T12:41:39Z)
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners [58.15511660018742]
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities. We develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems.
arXiv Detail & Related papers (2024-06-16T19:22:53Z)
To Believe or Not to Believe Your LLM [51.2579827761899]
We explore uncertainty quantification in large language models (LLMs) We derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large. We conduct a series of experiments which demonstrate the advantage of our formulation.
arXiv Detail & Related papers (2024-06-04T17:58:18Z)
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z)
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z)
On Learning Latent Models with Multi-Instance Weak Supervision [57.18649648182171]
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $sigma$ labels associated with multiple input instances. Our problem is met in different fields, including latent structural learning and neuro-symbolic integration.
arXiv Detail & Related papers (2023-06-23T22:05:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.