Learning Facts at Scale with Active Reading
- URL: http://arxiv.org/abs/2508.09494v1
- Date: Wed, 13 Aug 2025 04:54:43 GMT
- Title: Learning Facts at Scale with Active Reading
- Authors: Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, Barlas Oğuz,
- Abstract summary: We propose Active Reading, a framework where we train models to study a given set of material with self-generated learning strategies.<n>First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning.<n>We show that Active Reading can be utilized at pre-training scale to build more factual models.
- Score: 33.53569181772801
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.
Related papers
- Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning [59.19460954480119]
We study whether forgotten knowledge originates from pretraining or supervised fine-tuning.<n>Our experiments show that pretrained and SFT models respond differently to unlearning.
arXiv Detail & Related papers (2026-02-23T08:58:48Z) - KIF: Knowledge Identification and Fusion for Language Model Continual Learning [41.28933724210434]
We introduce a novel framework for language models, named Knowledge Identification and Fusion (KIF)<n>KIF segregates the model into'skill units' based on parameter dependencies, allowing for more precise control.<n>It employs a novel group-wise knowledge identification technique to ascertain the importance distribution of skill units for a new task.<n>As a result, KIF achieves an optimal balance between retaining prior knowledge and excelling in new tasks.
arXiv Detail & Related papers (2024-08-09T17:44:45Z) - Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall [31.45796499298925]
Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks.
We focus on assessing LLMs' ability to recall factual knowledge learned from pretraining.
We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses.
arXiv Detail & Related papers (2024-04-24T19:40:01Z) - Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws [51.68385617116854]
Scaling laws describe the relationship between the size of language models and their capabilities.
We focus on factual knowledge represented as domains, such as (USA, capital, Washington D.C.) from a Wikipedia page.
A 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined.
arXiv Detail & Related papers (2024-04-08T11:11:31Z) - Physics of Language Models: Part 3.1, Knowledge Storage and Extraction [51.68385617116854]
Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering.
We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data.
arXiv Detail & Related papers (2023-09-25T17:37:20Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - The Effect of Masking Strategies on Knowledge Retention by Language
Models [9.130890741447422]
This paper aims to understand the effect of pre-training tasks on the amount of knowledge captured and forgotten by language models.
We test the model's knowledge retention by measuring its ability to answer factual questions.
Our findings demonstrate that, like the ability to perform a task, the knowledge acquired from being trained on that task is forgotten when a model is trained to perform another task.
arXiv Detail & Related papers (2023-06-12T15:35:23Z) - Decouple knowledge from parameters for plug-and-play language modeling [77.5601135412186]
We introduce PlugLM, a pre-training model with differentiable plug-in memory(DPM)
The key intuition is to decouple the knowledge storage from model parameters with an editable and scalable key-value memory.
PlugLM obtains 3.95 F1 improvements across four domains on average without any in-domain pre-training.
arXiv Detail & Related papers (2023-05-19T10:01:55Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Effective training-time stacking for ensembling of deep neural networks [1.2667973028134798]
A snapshot ensembling collects models in the ensemble along a single training path.
Our method improves snapshot ensembling by selecting and weighting ensemble members along the training path.
It relies on training-time likelihoods without looking at validation sample errors that standard stacking methods do.
arXiv Detail & Related papers (2022-06-27T17:52:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.