Related papers: Inside-Out: Hidden Factual Knowledge in LLMs

Inside-Out: Hidden Factual Knowledge in LLMs

URL: http://arxiv.org/abs/2503.15299v2
Date: Mon, 24 Mar 2025 01:31:35 GMT
Title: Inside-Out: Hidden Factual Knowledge in LLMs
Authors: Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart,
Abstract summary: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs.<n>We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher.<n>We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup.
Score: 50.79758420289131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model's observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) put a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

Related papers

NanoKnow: How to Know What Your Language Model Knows [44.07087580987766]
We release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits.<n>Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output.<n>Our findings show that closed-book accuracy is strongly influenced by answer frequency in the pre-training data.
arXiv Detail & Related papers (2026-02-23T18:37:49Z)
Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation [89.65955788873532]
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP)<n>We propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation.
arXiv Detail & Related papers (2025-05-26T08:18:33Z)
Automatically Advancing LLM Expertise in Technology Judgment [1.1269582666887323]
Large language models (LLMs) are rapidly becoming core tools for science, engineering, and innovation.<n>Despite their impressive ability to answer increasingly difficult questions, it remains unclear whether LLMs truly use their knowledge when confronted with new and challenging tasks.<n>We evaluate a benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing.<n>We find that LLMs often fail our benchmark and struggle to distinguish among semantically similar patents.
arXiv Detail & Related papers (2025-05-18T15:04:02Z)
KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs [35.63483147113076]
Large language models (LLMs) have demonstrated remarkable capabilities in various complex tasks, yet they still suffer from hallucinations.<n>We propose KnowPath, a knowledge-enhanced large model framework driven by the collaboration of internal and external knowledge.
arXiv Detail & Related papers (2025-02-17T17:02:01Z)
Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
Large language models (LLMs) have shown promise as potential knowledge bases.<n>LLMs often struggle with question-answering tasks and are prone to hallucinations.<n>We develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge.
arXiv Detail & Related papers (2024-12-30T10:29:18Z)
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ) We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z)
Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction [15.534647327246239]
We propose to eliminate prompt engineering when probing large language models (LLMs) for factual knowledge.<n>Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the in-context learning ability of LLMs.<n>We perform a large-scale evaluation of the factual knowledge of a variety of open-source LLMs over a large set of relations and facts from the Wikidata knowledge base.
arXiv Detail & Related papers (2024-04-19T15:40:39Z)
RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge [69.79676144482792]
This study aims to evaluate the ability of LLMs to distinguish reliable information from external knowledge. Our benchmark consists of two tasks, Question Answering and Text Generation, and for each task, we provide models with a context containing counterfactual information.
arXiv Detail & Related papers (2023-11-14T13:24:19Z)
Self-Knowledge Guided Retrieval Augmentation for Large Language Models [59.771098292611846]
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. Self-Knowledge guided Retrieval augmentation (SKR) is a simple yet effective method which can let LLMs refer to the questions they have previously encountered.
arXiv Detail & Related papers (2023-10-08T04:22:33Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs [19.0797968186656]
Large language models (LLMs) are versatile and can solve different tasks due to their emergent ability and generalizability. In some previous works, additional modules like graph neural networks (GNNs) are trained on retrieved knowledge from external knowledge bases.
arXiv Detail & Related papers (2023-09-06T15:55:01Z)
Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well. Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries. We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv Detail & Related papers (2023-07-20T16:46:10Z)
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering [7.888547093390469]
Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks. We propose to augment the knowledge directly in the input of LLMs. Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot.
arXiv Detail & Related papers (2023-06-07T04:15:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.