Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
- URL: http://arxiv.org/abs/2502.04066v5
- Date: Sat, 11 Oct 2025 12:54:16 GMT
- Title: Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
- Authors: Changhao Jiang, Ming Zhang, Yifei Cao, Junjie Ye, Xiaoran Fan, Shihan Dou, Zhiheng Xi, Jiajun Sun, Yi Dong, Yujiong Shen, Jingqi Tong, Baoyu Fan, Qi Zhang, Tao Gui, Xuanjing Huang,
- Abstract summary: We model knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training.<n>We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy.
- Score: 68.94373533768501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves $R^2$ > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
Related papers
- RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training [59.493415006017635]
Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training.<n>Current evaluation relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs.<n>We propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training.
arXiv Detail & Related papers (2026-02-13T12:56:31Z) - Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training [11.179110411255708]
We propose a direct framework to model the scaling of benchmark performance from the training budget.<n>Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure.<n>We release the complete set of pretraining losses and downstream evaluation results.
arXiv Detail & Related papers (2025-12-09T18:33:48Z) - Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - A Statistical Framework for Data-dependent Retrieval-Augmented Models [46.781026675083254]
Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction.
We study such models with two components: 1) a em retriever to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a em predictor that consumes the input instances along with the retrieved information to make the final predictions.
arXiv Detail & Related papers (2024-08-27T20:51:06Z) - Reasoning and Tools for Human-Level Forecasting [0.4261908132550109]
We present Reasoning and Tools for Forecasting (RTF), a framework of reasoning-and-acting (ReAct) agents that can retrieve updated information and run numerical simulation with equipped tools.
We evaluate our model with questions from competitive forecasting platforms and demonstrate that our method is competitive with and can outperform human predictions.
arXiv Detail & Related papers (2024-08-21T23:42:06Z) - In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models [37.45103473809928]
We propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model.
By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data.
arXiv Detail & Related papers (2024-08-07T05:48:05Z) - Few-Shot Load Forecasting Under Data Scarcity in Smart Grids: A Meta-Learning Approach [0.18641315013048293]
This paper proposes adapting an established model-agnostic meta-learning algorithm for short-term load forecasting.
The proposed method can rapidly adapt and generalize within any unknown load time series of arbitrary length.
The proposed model is evaluated using a dataset of historical load consumption data from real-world consumers.
arXiv Detail & Related papers (2024-06-09T18:59:08Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models [37.02290559379761]
Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks.<n>Motivated by this, Adaptive Retrieval-Augmented Generation (ARAG) studies retrieving only when the knowledge asked by the query is absent in the LLM.
arXiv Detail & Related papers (2024-04-04T15:21:22Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models [29.17711426767209]
We study how to best select data that leads to good downstream model performance across tasks.
We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data.
arXiv Detail & Related papers (2023-07-26T18:01:49Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [117.48444197402858]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.<n>ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.<n>We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Where Should I Spend My FLOPS? Efficiency Evaluations of Visual
Pre-training Methods [29.141145775835106]
Given a fixed FLOP budget, what are the best datasets, models, and (self-supervised) training methods for obtaining high accuracy on representative visual tasks?
We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised)
Our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data.
arXiv Detail & Related papers (2022-09-30T17:04:55Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - Fast Uncertainty Quantification for Deep Object Pose Estimation [91.09217713805337]
Deep learning-based object pose estimators are often unreliable and overconfident.
In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose estimation.
arXiv Detail & Related papers (2020-11-16T06:51:55Z) - REALM: Retrieval-Augmented Language Model Pre-Training [37.3178586179607]
We augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia.
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner.
We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA)
arXiv Detail & Related papers (2020-02-10T18:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.