Related papers: CausalLM is not optimal for in-context learning

CausalLM is not optimal for in-context learning

URL: http://arxiv.org/abs/2308.06912v3
Date: Tue, 20 Feb 2024 22:48:06 GMT
Title: CausalLM is not optimal for in-context learning
Authors: Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut
Abstract summary: Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (LM) While this result is intuitive, it is not understood from a theoretical perspective. We take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction.
Score: 21.591451511589693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

Related papers

Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training. We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z)
Surgical Feature-Space Decomposition of LLMs: Why, When and How? [8.826164604720738]
We empirically study the efficacy of weight and feature space decomposition in transformer-based language models. We show that surgical decomposition provides critical insights into the trade-off between compression and language modelling performance. We extend our investigation to the implications of low-rank approximations on model bias.
arXiv Detail & Related papers (2024-05-17T07:34:03Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers [52.88268942796418]
Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer. We show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view.
arXiv Detail & Related papers (2023-09-25T13:35:28Z)
Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)
Rethinking Neural vs. Matrix-Factorization Collaborative Filtering: the Theoretical Perspectives [18.204325860752768]
Recent work argues that matrix-factorization collaborative filtering (MCF) compares favorably to neural collaborative filtering (NCF) In this paper, we address the comparison rigorously by answering the following questions.
arXiv Detail & Related papers (2021-10-23T04:55:21Z)
On Language Model Integration for RNN Transducer based Speech Recognition [49.84285563767935]
We study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction. We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer.
arXiv Detail & Related papers (2021-10-13T16:30:46Z)
On the Convergence Rate of Projected Gradient Descent for a Back-Projection based Objective [58.33065918353532]
We consider a back-projection based fidelity term as an alternative to the common least squares (LS) We show that using the BP term, rather than the LS term, requires fewer iterations of optimization algorithms.
arXiv Detail & Related papers (2020-05-03T00:58:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.