LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations
- URL: http://arxiv.org/abs/2509.03405v1
- Date: Wed, 03 Sep 2025 15:31:18 GMT
- Title: LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations
- Authors: Daniela Gottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva,
- Abstract summary: Language models (LMs) increasingly drive real-world applications that require world knowledge.<n>We present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining.<n>We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends.
- Score: 35.01080969148123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.
Related papers
- Training Language Models to Explain Their Own Computations [73.8562596518326]
We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior.<n>Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs.
arXiv Detail & Related papers (2025-11-11T18:57:14Z) - R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning [83.256752220849]
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge.<n>We introduce R1-Searcher++, a framework designed to train LLMs to adaptively leverage both internal and external knowledge sources.<n>Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval.
arXiv Detail & Related papers (2025-05-22T17:58:26Z) - Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning [26.861562920084264]
Large language models (LLMs) are applied across diverse domains.<n>We propose a novel method termed in-context knowledge unlearning''<n>Our method achieves up to 95% forget accuracy while retaining 80% of unrelated knowledge.
arXiv Detail & Related papers (2024-10-01T04:13:25Z) - LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models [2.1311017627417]
Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase.
We present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism.
arXiv Detail & Related papers (2024-08-28T11:44:52Z) - TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models [31.209774088374374]
This paper introduces TRELM, a Robust and Efficient Pre-training framework for Knowledge-Enhanced Language Models.
We employ a robust approach to inject knowledge triples and employ a knowledge-augmented memory bank to capture valuable information.
We show that TRELM reduces pre-training time by at least 50% and outperforms other KEPLMs in knowledge probing tasks and multiple knowledge-aware language understanding tasks.
arXiv Detail & Related papers (2024-03-17T13:04:35Z) - Towards Automated Knowledge Integration From Human-Interpretable Representations [55.2480439325792]
We introduce and motivate theoretically the principles of informed meta-learning enabling automated and controllable inductive bias selection.<n>We empirically demonstrate the potential benefits and limitations of informed meta-learning in improving data efficiency and generalisation.
arXiv Detail & Related papers (2024-02-25T15:08:37Z) - Can LMs Learn New Entities from Descriptions? Challenges in Propagating
Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts)
We find that existing methods for updating knowledge show little propagation of injected knowledge.
Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z) - Empowering Language Models with Knowledge Graph Reasoning for Question
Answering [117.79170629640525]
We propose knOwledge REasOning empowered Language Model (OREO-LM)
OREO-LM consists of a novel Knowledge Interaction Layer that can be flexibly plugged into existing Transformer-based LMs.
We show significant performance gain, achieving state-of-art results in the Closed-Book setting.
arXiv Detail & Related papers (2022-11-15T18:26:26Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - IELM: An Open Information Extraction Benchmark for Pre-Trained Language
Models [75.48081086368606]
We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM)
We create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs.
Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets.
arXiv Detail & Related papers (2022-10-25T16:25:00Z) - LM-CORE: Language Models with Contextually Relevant External Knowledge [13.451001884972033]
We argue that storing large amounts of knowledge in the model parameters is sub-optimal given the ever-growing amounts of knowledge and resource requirements.
We present LM-CORE -- a general framework to achieve this -- that allows textitdecoupling of the language model training from the external knowledge source.
Experimental results show that LM-CORE, having access to external knowledge, achieves significant and robust outperformance over state-of-the-art knowledge-enhanced language models on knowledge probing tasks.
arXiv Detail & Related papers (2022-08-12T18:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.