Related papers: How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

URL: http://arxiv.org/abs/2511.00763v1
Date: Sun, 02 Nov 2025 01:42:08 GMT
Title: How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks
Authors: Wanda Hou, Leon Zhou, Hong-Ye Hu, Yi-Zhuang You, Xiao-Liang Qi,
Abstract summary: We investigate the performance of large language models on repetitive deterministic prediction tasks.<n>Our experiments reveal a sharp double exponential drop beyond a characteristic length scale.<n>This indicates that the models fail to execute each operation independently.
Score: 0.9338697277815541
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples include letter replacement in strings following a given rule, integer addition, and multiplication of string operators in many body quantum mechanics. If the model performs the task through a simple repetition algorithm, the success rate should decay exponentially with sequence length. In contrast, our experiments on leading large language models reveal a sharp double exponential drop beyond a characteristic length scale, forming an accuracy cliff that marks the transition from reliable to unstable generation. This indicates that the models fail to execute each operation independently. To explain this phenomenon, we propose a statistical physics inspired model that captures the competition between external conditioning from the prompt and internal interference among generated tokens. The model quantitatively reproduces the observed crossover and provides an interpretable link between attention induced interference and sequence level failure. Fitting the model to empirical results across multiple models and tasks yields effective parameters that characterize the intrinsic error rate and error accumulation factor for each model task pair, offering a principled framework for understanding the limits of deterministic accuracy in large language models.

Related papers

TRACE: Scalable Amortized Causal Discovery from Single Sequences via Autoregressive Density Estimation [14.409508347156397]
We study causal discovery from a single observed sequence of discrete events generated by a process.<n>We introduce TRACE, a scalable framework that repurposes autoregressive models as pretrained density estimators for conditional mutual information estimation.
arXiv Detail & Related papers (2026-02-01T10:18:27Z)
NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction [7.856998585396422]
We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs.<n>Results indicate that even leading proprietary systems experience accuracy drops of up to 62% under certain perturbations.<n>These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.
arXiv Detail & Related papers (2025-11-13T05:09:52Z)
Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models [44.17697803306198]
We introduce textitCodeSeq, a synthetic post-training dataset built from number sequences.<n>Our pipeline generates supervised fine data by reflecting on failed test cases and incorporating iterative corrections.<n> Experimental results show that the models trained with textitCodeSeq improve on various reasoning tasks and can preserve the models' OOD performance.
arXiv Detail & Related papers (2025-10-16T12:29:40Z)
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls [54.57326125204404]
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication.<n>We study why, by reverse-engineering a model that successfully learns multiplication via emphimplicit chain-of-thought'
arXiv Detail & Related papers (2025-09-30T19:03:26Z)
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling [60.63703438729223]
We show how different architectures and training methods affect model multi-step reasoning capabilities.<n>We confirm that increasing model depth plays a crucial role for sequential computations.
arXiv Detail & Related papers (2025-08-22T18:57:08Z)
Spatial Reasoning with Denoising Models [49.83744014336816]
We introduce a framework to perform reasoning over sets of continuous variables via denoising generative models.<n>For the first time, that order of generation can successfully be predicted by the denoising network itself.<n>Using these findings, we can increase the accuracy of specific reasoning tasks from 1% to >50%.
arXiv Detail & Related papers (2025-02-28T14:08:30Z)
Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models [11.258630552727432]
We analyze how ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models. We show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity.
arXiv Detail & Related papers (2022-04-01T14:30:19Z)
Precise High-Dimensional Asymptotics for Quantifying Heterogeneous Transfers [66.66228496844191]
We show when does combining the samples from two related tasks perform better than learning with one target task alone?<n>This question is motivated by an empirical phenomenon known as negative transfer, which has been observed in practice.<n>We illustrate these results in a random-effects model to mathematically prove a phase transition from positive to negative transfer as the number of source task samples increases.
arXiv Detail & Related papers (2020-10-22T14:14:20Z)
Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner. We study the entropy, or uncertainty, of the model's token-level predictions. We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z)
On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family. We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z)
Consistency of a Recurrent Language Model With Respect to Incomplete Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model. We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.