Related papers: Implicit Bias of Next-Token Prediction

Implicit Bias of Next-Token Prediction

URL: http://arxiv.org/abs/2402.18551v1
Date: Wed, 28 Feb 2024 18:34:53 GMT
Title: Implicit Bias of Next-Token Prediction
Authors: Christos Thrampoulidis
Abstract summary: Next-its prediction (NTP) involves predicting the next token in a sequence. This work frames NTP training as cross-entropy minimization over distinct empirical contexts.
Score: 32.2896512612788
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Next-token prediction (NTP), the go-to training paradigm in training large language models, involves predicting the next token in a sequence. Departing from traditional one-hot classification, in NTP, multiple tokens with varying frequencies follow each given context. This work frames NTP training as cross-entropy minimization over distinct contexts, each associated with a sparse empirical probability vector across a finite vocabulary. It then addresses the following question: do gradient-based optimizers exhibit a bias towards solutions with specific structure as the NTP training loss reaches its lower bound (entropy)? Specifically, for linear NTP models trained using gradient descent (GD), we make the following contributions: Firstly, we determine NTP-separability conditions on the data, under which GD can attain its lower bound. We also demonstrate that these conditions hold under overparameterization. Secondly, we establish that the parameters of GD projected onto an appropriate data subspace converge to the unique solution of a system of linear equations, which requires the logits' difference of in-support tokens to be equal to the log-ratio of their respective probabilities. Meanwhile, on the orthogonal subspace, the parameters diverge and converge in the direction of the solution of a max-margin quadratic program, minimizing the Euclidean norm of parameters satisfying the \NTP-separability conditions. Akin to prior research on implicit bias of one-hot classification, our work opens exciting avenues for future research that can lead to better understanding optimization, generalization and robustness principles of models trained with NTP.

Related papers

Preference Alignment Improves Language Model-Based TTS [76.70693823683091]
preference alignment algorithms adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores.
arXiv Detail & Related papers (2024-09-19T01:58:19Z)
Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations [24.211603400355756]
Next-token prediction (NTP) over large text corpora has become the go-to paradigm to train large language models. We look at how NTP influences the mapping of linguistic patterns to geometric properties of the resulting model representations. We validate our findings on synthetic and small-scale real language datasets.
arXiv Detail & Related papers (2024-08-27T21:46:47Z)
RoPINN: Region Optimized Physics-Informed Neural Networks [66.38369833561039]
Physics-informed neural networks (PINNs) have been widely applied to solve partial differential equations (PDEs) This paper proposes and theoretically studies a new training paradigm as region optimization. A practical training algorithm, Region Optimized PINN (RoPINN), is seamlessly derived from this new paradigm.
arXiv Detail & Related papers (2024-05-23T09:45:57Z)
Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization. We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z)
Exploiting Inferential Structure in Neural Processes [15.058161307401864]
Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set. We provide a framework that allows NPs' latent variable to be given a rich prior defined by a graphical model.
arXiv Detail & Related papers (2023-06-27T03:01:43Z)
Neural Processes with Stochastic Attention: Paying more attention to the context dataset [11.301294319986477]
Neural processes (NPs) aim to complete unseen data points based on a given context dataset. We propose a attention mechanism for NPs to capture appropriate context information. We empirically show that our approach substantially outperforms conventional NPs in various domains.
arXiv Detail & Related papers (2022-04-11T23:57:19Z)
Probabilistic Circuits for Variational Inference in Discrete Graphical Models [101.28528515775842]
Inference in discrete graphical models with variational methods is difficult. Many sampling-based methods have been proposed for estimating Evidence Lower Bound (ELBO) We propose a new approach that leverages the tractability of probabilistic circuit models, such as Sum Product Networks (SPN) We show that selective-SPNs are suitable as an expressive variational distribution, and prove that when the log-density of the target model is aweighted the corresponding ELBO can be computed analytically.
arXiv Detail & Related papers (2020-10-22T05:04:38Z)
Learning Reasoning Strategies in End-to-End Differentiable Proving [50.9791149533921]
Conditional Theorem Provers learn optimal rule selection strategy via gradient-based optimisation. We show that Conditional Theorem Provers are scalable and yield state-of-the-art results on the CLUTRR dataset.
arXiv Detail & Related papers (2020-07-13T16:22:14Z)
Deep connections between learning from limited labels & physical parameter estimation -- inspiration for regularization [0.0]
We show that explicit regularization of model parameters in PDE constrained optimization translates to regularization of the network output. A hyperspectral imaging example shows that minimum prior information together with cross-validation for optimal regularization parameters boosts the segmentation accuracy.
arXiv Detail & Related papers (2020-03-17T19:33:50Z)
Supervised Learning for Non-Sequential Data: A Canonical Polyadic Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks. To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor. For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.