Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks
- URL: http://arxiv.org/abs/2510.21866v1
- Date: Thu, 23 Oct 2025 11:09:31 GMT
- Title: Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks
- Authors: Javier MarĂn,
- Abstract summary: We document capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks.<n>We quantify capability-specific scaling failures in OPT and Pythia model families to inform resource allocation decisions.
- Score: 0.2538209532048866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We document empirical capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks. Systematic evaluation of OPT and Pythia model families (70M-30B parameters, spanning 240 times scaling) reveals that knowledge retrieval tasks show negligible accuracy improvement despite smooth loss reduction. On MMLU mathematics benchmarks, accuracy remains flat at 19-20% (below 25% random chance) across all scales while cross-entropy loss decreases by 31%. In contrast, procedural tasks like arithmetic show conventional scaling where both metrics improve together. Attention intervention experiments reveal high sensitivity to perturbation: swapping attention patterns between models causes catastrophic performance collapse (complete accuracy loss) rather than graceful degradation. These measurements have immediate engineering implications: for knowledge-intensive applications using OPT and Pythia architectures, parameter scaling beyond 1-2B offers minimal accuracy gains despite continued loss improvement. Our findings quantify capability-specific scaling failures in these model families to inform resource allocation decisions. Whether these patterns reflect fundamental constraints of decoder-only architectures or implementation-specific limitations remains an open question requiring investigation across diverse architectural approaches.
Related papers
- Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels [72.3670919950349]
Large language models (LLMs) acquire substantial world knowledge during pre-training.<n>Post-training techniques such as supervised fine-tuning (SFT) shape this knowledge change behavior.<n>We evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families.
arXiv Detail & Related papers (2025-09-20T09:40:32Z) - Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - Uncertainty Weighted Gradients for Model Calibration [22.39558434131574]
Deep networks often produce over-confident or under-confident predictions, leading to miscalibration.<n>We propose a unified loss framework for focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor.<n>Our method achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-03-26T04:16:05Z) - Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - Lightweight, Uncertainty-Aware Conformalized Visual Odometry [2.429910016019183]
Data-driven visual odometry (VO) is a critical subroutine for autonomous edge robotics.
Emerging edge robotics devices like insect-scale drones and surgical robots lack a computationally efficient framework to estimate VO's predictive uncertainties.
This paper presents a novel, lightweight, and statistically robust framework that leverages conformal inference (CI) to extract VO's uncertainty bands.
arXiv Detail & Related papers (2023-03-03T20:37:55Z) - DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep
Surrogate Model [12.335763358698564]
We propose DeepFT to proactively avoid system overloads and their adverse effects.
DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system.
It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
arXiv Detail & Related papers (2022-12-02T16:51:58Z) - Do Not Forget to Attend to Uncertainty while Mitigating Catastrophic
Forgetting [29.196246255389664]
One of the major limitations of deep learning models is that they face catastrophic forgetting in an incremental learning scenario.
We consider a Bayesian formulation to obtain the data and model uncertainties.
We also incorporate self-attention framework to address the incremental learning problem.
arXiv Detail & Related papers (2021-02-03T06:54:52Z) - A comparison of Monte Carlo dropout and bootstrap aggregation on the
performance and uncertainty estimation in radiation therapy dose prediction
with deep learning neural networks [0.46180371154032895]
We propose to use Monte Carlo dropout (MCDO) and the bootstrap aggregation (bagging) technique on deep learning models to produce uncertainty estimations for radiation therapy dose prediction.
Performance-wise, bagging provides statistically significant reduced loss value and errors in most of the metrics investigated.
arXiv Detail & Related papers (2020-11-01T00:24:43Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.