Think before you speak: Training Language Models With Pause Tokens
- URL: http://arxiv.org/abs/2310.02226v3
- Date: Sun, 21 Apr 2024 03:39:21 GMT
- Title: Think before you speak: Training Language Models With Pause Tokens
- Authors: Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan,
- Abstract summary: Language models generate responses by producing a series of tokens in immediate succession.
What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)th$ token?
We operationalize this idea by performing training and inference on language models with a (learnable) $textitpause$ token.
- Score: 73.61375226378712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
Related papers
- Language Models Can Predict Their Own Behavior [28.80639362933004]
We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence.
We leverage this capacity and learn probes on internal states to create early warning (and exit) systems.
Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead.
arXiv Detail & Related papers (2025-02-18T23:13:16Z) - Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.
Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.
No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z) - Better & Faster Large Language Models via Multi-token Prediction [29.067271500844928]
Large language models such as GPT and Llama are trained with a next-token prediction loss.
We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
arXiv Detail & Related papers (2024-04-30T17:33:57Z) - Rho-1: Not All Tokens Are What You Need [132.31428897792114]
Previous language model pre-training methods uniformly applied a next-token prediction loss to all training tokens.
Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that align with desired distribution.
When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks.
arXiv Detail & Related papers (2024-04-11T17:52:01Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Mechanics of Next Token Prediction with Self-Attention [41.82477691012942]
Transformer-based language models are trained on large datasets to predict the next token given an input sequence.
We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps.
We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.
arXiv Detail & Related papers (2024-03-12T21:15:38Z) - Language Model Inversion [77.22715643068284]
We show that next-token probabilities contain a surprising amount of information about the preceding text.
Our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27%$ of prompts exactly.
arXiv Detail & Related papers (2023-11-22T19:04:04Z) - Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models.
We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead.
This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.