Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
- URL: http://arxiv.org/abs/2510.13879v1
- Date: Mon, 13 Oct 2025 21:07:05 GMT
- Title: Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
- Authors: Alexandre Galashov, Matt Jones, Rosemary Ke, Yuan Cao, Vaishnavh Nagarajan, Michael C. Mozer,
- Abstract summary: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token.<n>For any token, the model can request additional compute steps by emitting a don't know> output.<n>We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context.
- Score: 55.76222360698305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a <don't know> output. If the model is granted a delay, a specialized <pause> token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use <don't know> outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.
Related papers
- Multi-Token Prediction via Self-Distillation [73.81494481537636]
We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
arXiv Detail & Related papers (2026-02-05T18:54:48Z) - Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model [2.7074235008521246]
We analyze neural scaling laws in a solvable model of last-layer fine-tuning where targets have intrinsic, instance-heterogeneous difficulty.<n>We show that learning shrinks the hard tail'' of the error distribution.
arXiv Detail & Related papers (2026-01-07T10:00:17Z) - Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation [19.158395403281734]
Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference?<n>This problem has broad applications in prompt tuning and chain-of-thought reasoning.<n>We show that the gradient estimation procedure yields approximations of full inference with less than $mathbf1%$ error across six datasets.
arXiv Detail & Related papers (2025-08-27T15:59:47Z) - Intention-Conditioned Flow Occupancy Models [80.42634994902858]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - Language Models Can Predict Their Own Behavior [29.566208688211876]
Language models (LMs) can exhibit specific behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment.<n>We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence.<n>An early warning system built on the probes reduces jailbreaking by 91%.
arXiv Detail & Related papers (2025-02-18T23:13:16Z) - s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Think before you speak: Training Language Models With Pause Tokens [73.61375226378712]
Language models generate responses by producing a series of tokens in immediate succession.
What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)th$ token?
We operationalize this idea by performing training and inference on language models with a (learnable) $textitpause$ token.
arXiv Detail & Related papers (2023-10-03T17:32:41Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.