Expanding Computation Spaces of LLMs at Inference Time
- URL: http://arxiv.org/abs/2509.24884v1
- Date: Mon, 29 Sep 2025 14:59:44 GMT
- Title: Expanding Computation Spaces of LLMs at Inference Time
- Authors: Yoonna Jang, Kisu Yang, Isabelle Augenstein,
- Abstract summary: Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving.<n>We investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference.
- Score: 33.17624792878245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving, benefiting not only from detailed reasoning steps but also from the expanded computational space of longer inputs. Prior work has trained filler or special tokens to serve as additional computation spaces. In this study, we investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference. We first identify effective token types, numbers, and insertion locations, then examine at what stage of training models begin to exploit the expanded computation space, and finally analyze dynamics within these spaces via attention maps. Experiments on models ranging from 1.7B to 32B across open-domain QA and math tasks show that appropriate token types and counts vary, but placing filler tokens directly before the final 'Answer:' token is most effective. Smaller models benefit most, up to 12.372 percentage points in SmolLM2-1.7B-Instruct, indicating that these spaces act as additional computational capacity rather than redundant input. Attention maps reveal that expanded spaces often continue the original attention mechanism and sometimes focus on questions or answer options, suggesting meaningful computation for problem-solving.
Related papers
- All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens [14.890542559477906]
In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens.<n>We investigate the question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers.
arXiv Detail & Related papers (2025-09-11T17:41:29Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Understanding In-context Learning of Addition via Activation Subspaces [73.8295576941241]
We study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input.<n>We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition.<n>Our results demonstrate how tracking low-dimensional subspaces of localized heads across a forward pass can provide insight into fine-grained computational structures in language models.
arXiv Detail & Related papers (2025-05-08T11:32:46Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency [47.03718208259308]
We introduce SAISA, a novel architecture that enhance both training and inference efficiency.<n>Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66% and training budget by 26%, while achieving superior performance in terms of accuracy.
arXiv Detail & Related papers (2025-02-04T16:28:53Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks.
We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks.
We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Arithmetic with Language Models: from Memorization to Computation [3.077668143048211]
This work investigates how a language model, trained to predict the next token, can perform arithmetic computations generalizing beyond training data.
We successfully trained a light language model to learn these tasks and ran a number of experiments to investigate the extrapolation capabilities and internal information processing.
arXiv Detail & Related papers (2023-08-02T13:58:37Z) - Exploring the Space of Key-Value-Query Models with Intention [8.585795909956726]
Two key components of Attention are the structure of its input (which consists of keys, values and queries) and the computations by which these three are combined.
We refer to this space as Keys-Values-Queries ( KVQ) Space.
Our goal is to determine whether there are any other stackable models in KVQ Space that Attention cannot efficiently approximate.
arXiv Detail & Related papers (2023-05-17T13:25:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.