Implicit Optimization Bias of Next-Token Prediction in Linear Models
- URL: http://arxiv.org/abs/2402.18551v2
- Date: Thu, 31 Oct 2024 17:01:45 GMT
- Title: Implicit Optimization Bias of Next-Token Prediction in Linear Models
- Authors: Christos Thrampoulidis,
- Abstract summary: Next-token prediction (NTP) is the dominant training paradigm for modern language models.
We study the structural properties of the solutions selected by gradient-based generalizations.
- Score: 32.2896512612788
- License:
- Abstract: We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across distinct contexts, each tied with a sparse conditional probability distribution across a finite vocabulary of tokens, we introduce "NTP-separability conditions" that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds. In the orthogonal subspace, the GD parameters diverge in norm and select the direction that maximizes a margin specific to NTP. These findings extend previous research on implicit bias in one-hot classification to the NTP setting, highlighting key differences and prompting further research into the optimization and generalization properties of NTP, irrespective of the specific architecture used to generate the context embeddings.
Related papers
- Preference Alignment Improves Language Model-Based TTS [76.70693823683091]
preference alignment algorithms adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content.
With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores.
arXiv Detail & Related papers (2024-09-19T01:58:19Z) - Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations [24.211603400355756]
Next-token prediction (NTP) over large text corpora has become the go-to paradigm to train large language models.
We look at how NTP influences the mapping of linguistic patterns to geometric properties of the resulting model representations.
We validate our findings on synthetic and small-scale real language datasets.
arXiv Detail & Related papers (2024-08-27T21:46:47Z) - RoPINN: Region Optimized Physics-Informed Neural Networks [66.38369833561039]
Physics-informed neural networks (PINNs) have been widely applied to solve partial differential equations (PDEs)
This paper proposes and theoretically studies a new training paradigm as region optimization.
A practical training algorithm, Region Optimized PINN (RoPINN), is seamlessly derived from this new paradigm.
arXiv Detail & Related papers (2024-05-23T09:45:57Z) - Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization.
We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z) - Exploiting Inferential Structure in Neural Processes [15.058161307401864]
Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set.
We provide a framework that allows NPs' latent variable to be given a rich prior defined by a graphical model.
arXiv Detail & Related papers (2023-06-27T03:01:43Z) - Neural Processes with Stochastic Attention: Paying more attention to the
context dataset [11.301294319986477]
Neural processes (NPs) aim to complete unseen data points based on a given context dataset.
We propose a attention mechanism for NPs to capture appropriate context information.
We empirically show that our approach substantially outperforms conventional NPs in various domains.
arXiv Detail & Related papers (2022-04-11T23:57:19Z) - Probabilistic Circuits for Variational Inference in Discrete Graphical
Models [101.28528515775842]
Inference in discrete graphical models with variational methods is difficult.
Many sampling-based methods have been proposed for estimating Evidence Lower Bound (ELBO)
We propose a new approach that leverages the tractability of probabilistic circuit models, such as Sum Product Networks (SPN)
We show that selective-SPNs are suitable as an expressive variational distribution, and prove that when the log-density of the target model is aweighted the corresponding ELBO can be computed analytically.
arXiv Detail & Related papers (2020-10-22T05:04:38Z) - Learning Reasoning Strategies in End-to-End Differentiable Proving [50.9791149533921]
Conditional Theorem Provers learn optimal rule selection strategy via gradient-based optimisation.
We show that Conditional Theorem Provers are scalable and yield state-of-the-art results on the CLUTRR dataset.
arXiv Detail & Related papers (2020-07-13T16:22:14Z) - Deep connections between learning from limited labels & physical
parameter estimation -- inspiration for regularization [0.0]
We show that explicit regularization of model parameters in PDE constrained optimization translates to regularization of the network output.
A hyperspectral imaging example shows that minimum prior information together with cross-validation for optimal regularization parameters boosts the segmentation accuracy.
arXiv Detail & Related papers (2020-03-17T19:33:50Z) - Supervised Learning for Non-Sequential Data: A Canonical Polyadic
Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks.
To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor.
For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.