Related papers: Reusing Pre-Training Data at Test Time is a Compute Multiplier

Reusing Pre-Training Data at Test Time is a Compute Multiplier

URL: http://arxiv.org/abs/2511.04234v1
Date: Thu, 06 Nov 2025 10:10:43 GMT
Title: Reusing Pre-Training Data at Test Time is a Compute Multiplier
Authors: Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter,
Abstract summary: We quantify how much dataset value was left behind by the process of pre-training.<n>We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains.<n>These results can be further improved by leveraging additional compute at test time to parse the retrieved context.
Score: 35.81885343245217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.

Related papers

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Adaptive Memory Replay for Continual Learning [29.333341368722653]
Updating Foundation Models as new data becomes available can lead to catastrophic forgetting' We introduce a framework of adaptive memory replay for continual learning, where sampling of past data is phrased as a multi-armed bandit problem. We demonstrate the effectiveness of our approach, which maintains high performance while reducing forgetting by up to 10% at no training efficiency cost.
arXiv Detail & Related papers (2024-04-18T22:01:56Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models. To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters. We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z)
D4: Improving LLM Pretraining via Document De-Duplication and Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training. We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z)
A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from. Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes. One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z)
Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods [29.141145775835106]
Given a fixed FLOP budget, what are the best datasets, models, and (self-supervised) training methods for obtaining high accuracy on representative visual tasks? We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised) Our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data.
arXiv Detail & Related papers (2022-09-30T17:04:55Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
Domain-matched Pre-training Tasks for Dense Retrieval [68.07140087626637]
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations.
arXiv Detail & Related papers (2021-07-28T19:13:00Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.