Related papers: Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks

Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks

URL: http://arxiv.org/abs/2510.21866v1
Date: Thu, 23 Oct 2025 11:09:31 GMT
Title: Capability Ceilings in Autoregressive Language Models: Empirical Evidence from Knowledge-Intensive Tasks
Authors: Javier Marín,
Abstract summary: We document capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks.<n>We quantify capability-specific scaling failures in OPT and Pythia model families to inform resource allocation decisions.
Score: 0.2538209532048866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We document empirical capability ceilings in decoder-only autoregressive language models across knowledge-intensive tasks. Systematic evaluation of OPT and Pythia model families (70M-30B parameters, spanning 240 times scaling) reveals that knowledge retrieval tasks show negligible accuracy improvement despite smooth loss reduction. On MMLU mathematics benchmarks, accuracy remains flat at 19-20% (below 25% random chance) across all scales while cross-entropy loss decreases by 31%. In contrast, procedural tasks like arithmetic show conventional scaling where both metrics improve together. Attention intervention experiments reveal high sensitivity to perturbation: swapping attention patterns between models causes catastrophic performance collapse (complete accuracy loss) rather than graceful degradation. These measurements have immediate engineering implications: for knowledge-intensive applications using OPT and Pythia architectures, parameter scaling beyond 1-2B offers minimal accuracy gains despite continued loss improvement. Our findings quantify capability-specific scaling failures in these model families to inform resource allocation decisions. Whether these patterns reflect fundamental constraints of decoder-only architectures or implementation-specific limitations remains an open question requiring investigation across diverse architectural approaches.

Related papers

Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels [72.3670919950349]
Large language models (LLMs) acquire substantial world knowledge during pre-training.<n>Post-training techniques such as supervised fine-tuning (SFT) shape this knowledge change behavior.<n>We evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families.
arXiv Detail & Related papers (2025-09-20T09:40:32Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Uncertainty Weighted Gradients for Model Calibration [22.39558434131574]
Deep networks often produce over-confident or under-confident predictions, leading to miscalibration.<n>We propose a unified loss framework for focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor.<n>Our method achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-03-26T04:16:05Z)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [68.94373533768501]
We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z)
Lightweight, Uncertainty-Aware Conformalized Visual Odometry [2.429910016019183]
Data-driven visual odometry (VO) is a critical subroutine for autonomous edge robotics. Emerging edge robotics devices like insect-scale drones and surgical robots lack a computationally efficient framework to estimate VO's predictive uncertainties. This paper presents a novel, lightweight, and statistically robust framework that leverages conformal inference (CI) to extract VO's uncertainty bands.
arXiv Detail & Related papers (2023-03-03T20:37:55Z)
DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model [12.335763358698564]
We propose DeepFT to proactively avoid system overloads and their adverse effects. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
arXiv Detail & Related papers (2022-12-02T16:51:58Z)
Do Not Forget to Attend to Uncertainty while Mitigating Catastrophic Forgetting [29.196246255389664]
One of the major limitations of deep learning models is that they face catastrophic forgetting in an incremental learning scenario. We consider a Bayesian formulation to obtain the data and model uncertainties. We also incorporate self-attention framework to address the incremental learning problem.
arXiv Detail & Related papers (2021-02-03T06:54:52Z)
A comparison of Monte Carlo dropout and bootstrap aggregation on the performance and uncertainty estimation in radiation therapy dose prediction with deep learning neural networks [0.46180371154032895]
We propose to use Monte Carlo dropout (MCDO) and the bootstrap aggregation (bagging) technique on deep learning models to produce uncertainty estimations for radiation therapy dose prediction. Performance-wise, bagging provides statistically significant reduced loss value and errors in most of the metrics investigated.
arXiv Detail & Related papers (2020-11-01T00:24:43Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions. influence estimates are fairly accurate for shallow networks. Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.