Implicit Bias of Next-Token Prediction
- URL: http://arxiv.org/abs/2402.18551v1
- Date: Wed, 28 Feb 2024 18:34:53 GMT
- Title: Implicit Bias of Next-Token Prediction
- Authors: Christos Thrampoulidis
- Abstract summary: Next-its prediction (NTP) involves predicting the next token in a sequence.
This work frames NTP training as cross-entropy minimization over distinct empirical contexts.
- Score: 32.2896512612788
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Next-token prediction (NTP), the go-to training paradigm in training large
language models, involves predicting the next token in a sequence. Departing
from traditional one-hot classification, in NTP, multiple tokens with varying
frequencies follow each given context. This work frames NTP training as
cross-entropy minimization over distinct contexts, each associated with a
sparse empirical probability vector across a finite vocabulary. It then
addresses the following question: do gradient-based optimizers exhibit a bias
towards solutions with specific structure as the NTP training loss reaches its
lower bound (entropy)? Specifically, for linear NTP models trained using
gradient descent (GD), we make the following contributions: Firstly, we
determine NTP-separability conditions on the data, under which GD can attain
its lower bound. We also demonstrate that these conditions hold under
overparameterization. Secondly, we establish that the parameters of GD
projected onto an appropriate data subspace converge to the unique solution of
a system of linear equations, which requires the logits' difference of
in-support tokens to be equal to the log-ratio of their respective
probabilities. Meanwhile, on the orthogonal subspace, the parameters diverge
and converge in the direction of the solution of a max-margin quadratic
program, minimizing the Euclidean norm of parameters satisfying the
\NTP-separability conditions. Akin to prior research on implicit bias of
one-hot classification, our work opens exciting avenues for future research
that can lead to better understanding optimization, generalization and
robustness principles of models trained with NTP.
Related papers
- Preference Alignment Improves Language Model-Based TTS [76.70693823683091]
preference alignment algorithms adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content.
With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores.
arXiv Detail & Related papers (2024-09-19T01:58:19Z) - Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations [24.211603400355756]
Next-token prediction (NTP) over large text corpora has become the go-to paradigm to train large language models.
We look at how NTP influences the mapping of linguistic patterns to geometric properties of the resulting model representations.
We validate our findings on synthetic and small-scale real language datasets.
arXiv Detail & Related papers (2024-08-27T21:46:47Z) - RoPINN: Region Optimized Physics-Informed Neural Networks [66.38369833561039]
Physics-informed neural networks (PINNs) have been widely applied to solve partial differential equations (PDEs)
This paper proposes and theoretically studies a new training paradigm as region optimization.
A practical training algorithm, Region Optimized PINN (RoPINN), is seamlessly derived from this new paradigm.
arXiv Detail & Related papers (2024-05-23T09:45:57Z) - Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization.
We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z) - Exploiting Inferential Structure in Neural Processes [15.058161307401864]
Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set.
We provide a framework that allows NPs' latent variable to be given a rich prior defined by a graphical model.
arXiv Detail & Related papers (2023-06-27T03:01:43Z) - Neural Processes with Stochastic Attention: Paying more attention to the
context dataset [11.301294319986477]
Neural processes (NPs) aim to complete unseen data points based on a given context dataset.
We propose a attention mechanism for NPs to capture appropriate context information.
We empirically show that our approach substantially outperforms conventional NPs in various domains.
arXiv Detail & Related papers (2022-04-11T23:57:19Z) - Probabilistic Circuits for Variational Inference in Discrete Graphical
Models [101.28528515775842]
Inference in discrete graphical models with variational methods is difficult.
Many sampling-based methods have been proposed for estimating Evidence Lower Bound (ELBO)
We propose a new approach that leverages the tractability of probabilistic circuit models, such as Sum Product Networks (SPN)
We show that selective-SPNs are suitable as an expressive variational distribution, and prove that when the log-density of the target model is aweighted the corresponding ELBO can be computed analytically.
arXiv Detail & Related papers (2020-10-22T05:04:38Z) - Learning Reasoning Strategies in End-to-End Differentiable Proving [50.9791149533921]
Conditional Theorem Provers learn optimal rule selection strategy via gradient-based optimisation.
We show that Conditional Theorem Provers are scalable and yield state-of-the-art results on the CLUTRR dataset.
arXiv Detail & Related papers (2020-07-13T16:22:14Z) - Deep connections between learning from limited labels & physical
parameter estimation -- inspiration for regularization [0.0]
We show that explicit regularization of model parameters in PDE constrained optimization translates to regularization of the network output.
A hyperspectral imaging example shows that minimum prior information together with cross-validation for optimal regularization parameters boosts the segmentation accuracy.
arXiv Detail & Related papers (2020-03-17T19:33:50Z) - Supervised Learning for Non-Sequential Data: A Canonical Polyadic
Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks.
To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor.
For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.