Better & Faster Large Language Models via Multi-token Prediction
- URL: http://arxiv.org/abs/2404.19737v1
- Date: Tue, 30 Apr 2024 17:33:57 GMT
- Title: Better & Faster Large Language Models via Multi-token Prediction
- Authors: Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve,
- Abstract summary: Large language models such as GPT and Llama are trained with a next-token prediction loss.
We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
- Score: 29.067271500844928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.
Related papers
- Improving Next Tokens via Second-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder only architecture for predicting the second last token for a sequence of tokens.
Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.
By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Rho-1: Not All Tokens Are What You Need [132.31428897792114]
Previous language model pre-training methods uniformly applied a next-token prediction loss to all training tokens.
Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens.
Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM)
Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks.
arXiv Detail & Related papers (2024-04-11T17:52:01Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Language models are better than humans at next-token prediction [3.092847651108554]
It is not clear whether language models are better or worse than humans at next token prediction.
We find humans to be consistently emphworse than even relatively small language models like GPT3-Ada at next-token prediction.
arXiv Detail & Related papers (2022-12-21T17:58:01Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.