All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
- URL: http://arxiv.org/abs/2410.23501v2
- Date: Sat, 15 Mar 2025 10:30:45 GMT
- Title: All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
- Authors: Emanuele Marconato, Sébastien Lachapelle, Sebastian Weichwald, Luigi Gresele,
- Abstract summary: We analyze identifiability as a possible explanation for the ubiquity of linear properties across language models.<n>We show that under suitable conditions, these linear properties either hold in all or none distribution-equivalent next-token predictors.
- Score: 7.334847424898197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze identifiability as a possible explanation for the ubiquity of linear properties across language models, such as the vector difference between the representations of "easy" and "easiest" being parallel to that between "lucky" and "luckiest". For this, we ask whether finding a linear property in one model implies that any model that induces the same distribution has that property, too. To answer that, we first prove an identifiability result to characterize distribution-equivalent next-token predictors, lifting a diversity requirement of previous results. Second, based on a refinement of relational linearity [Paccanaro and Hinton, 2001; Hernandez et al., 2024], we show how many notions of linearity are amenable to our analysis. Finally, we show that under suitable conditions, these linear properties either hold in all or none distribution-equivalent next-token predictors.
Related papers
- Logit Distance Bounds Representational Similarity [18.79873056204737]
We study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees.<n>We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice.
arXiv Detail & Related papers (2026-02-17T09:00:56Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [79.01538178959726]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.
We introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - On the Origins of Linear Representations in Large Language Models [51.88404605700344]
We introduce a simple latent variable model to formalize the concept dynamics of the next token prediction.
Experiments show that linear representations emerge when learning from data matching the latent variable model.
We additionally confirm some predictions of the theory using the LLaMA-2 large language model.
arXiv Detail & Related papers (2024-03-06T17:17:36Z) - Curvature-informed multi-task learning for graph networks [56.155331323304]
State-of-the-art graph neural networks attempt to predict multiple properties simultaneously.
We investigate a potential explanation for this phenomenon: the curvature of each property's loss surface significantly varies, leading to inefficient learning.
arXiv Detail & Related papers (2022-08-02T18:18:41Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - Rationales for Sequential Predictions [117.93025782838123]
Sequence models are a critical component of modern NLP systems, but their predictions are difficult to explain.
We consider model explanations though rationales, subsets of context that can explain individual model predictions.
We propose an efficient greedy algorithm to approximate this objective.
arXiv Detail & Related papers (2021-09-14T01:25:15Z) - Performance of Bayesian linear regression in a model with mismatch [8.60118148262922]
We analyze the performance of an estimator given by the mean of a log-concave Bayesian posterior distribution with gaussian prior.
This inference model can be rephrased as a version of the Gardner model in spin glasses.
arXiv Detail & Related papers (2021-07-14T18:50:13Z) - Obstructing Classification via Projection [2.456909016197174]
We study a geometric problem which models a possible approach for bias removal.
A priori we assume that it is "easy" to classify the data according to each property.
Our goal is to obstruct the classification according to one property by a suitable projection to a lower-dimensional Euclidean space Rm.
arXiv Detail & Related papers (2021-05-19T10:28:15Z) - Why do classifier accuracies show linear trends under distribution
shift? [58.40438263312526]
accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution.
We assume the probability that two models agree in their predictions is higher than what we can infer from their accuracy levels alone.
We show that a linear trend must occur when evaluating models on two distributions unless the size of the distribution shift is large.
arXiv Detail & Related papers (2020-12-31T07:24:30Z) - Learning Probabilistic Sentence Representations from Paraphrases [47.528336088976744]
We define probabilistic models that produce distributions for sentences.
We train our models on paraphrases and demonstrate that they naturally capture sentence specificity.
Our model captures sentential entailment and provides ways to analyze the specificity and preciseness of individual words.
arXiv Detail & Related papers (2020-05-16T21:10:28Z) - Linear predictor on linearly-generated data with missing values: non
consistency and solutions [0.0]
We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data.
We show that, in the presence of missing values, the optimal predictor may not be linear.
arXiv Detail & Related papers (2020-02-03T11:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.