The Trilemma of Truth in Large Language Models
- URL: http://arxiv.org/abs/2506.23921v2
- Date: Tue, 08 Jul 2025 21:09:56 GMT
- Title: The Trilemma of Truth in Large Language Models
- Authors: Germans Savcisens, Tina Eliassi-Rad,
- Abstract summary: We examine two common methods for probing the veracity of large language models (LLMs)<n>We introduce sAwMIL, a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither.<n>We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets.
- Score: 1.62933895796838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.
Related papers
- Inside-Out: Hidden Factual Knowledge in LLMs [50.79758420289131]
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs.<n>We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher.<n>We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup.
arXiv Detail & Related papers (2025-03-19T15:21:48Z) - On Fact and Frequency: LLM Responses to Misinformation Expressed with Uncertainty [0.0]
We study the response of three widely used LLMs to misinformation propositions that have been verified false and then are transformed into uncertain statements.<n>Our results show that after transformation, LLMs change their factchecking classification from false to not-false in 25% of the cases.<n>The exception is doxastic transformations, which use linguistic cue phrases such as "It is believed..."
arXiv Detail & Related papers (2025-03-06T10:02:25Z) - Understanding Knowledge Drift in LLMs through Misinformation [11.605377799885238]
Large Language Models (LLMs) have revolutionized numerous applications, making them an integral part of our digital ecosystem.
We analyze the susceptibility of state-of-the-art LLMs to factual inaccuracies when they encounter false information in a QnA scenario.
Our experiments reveal that an LLM's uncertainty can increase up to 56.6% when the question is answered incorrectly.
arXiv Detail & Related papers (2024-09-11T08:11:16Z) - Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data [9.31120925026271]
We study inductive out-of-context reasoning (OOCR) in which LLMs infer latent information from evidence distributed across training documents.<n>In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities.<n>While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures.
arXiv Detail & Related papers (2024-06-20T17:55:04Z) - A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation [72.93327642336078]
We propose Belief Tree Propagation (BTProp), a probabilistic framework for hallucination detection.<n>BTProp introduces a belief tree of logically related statements by decomposing a parent statement into child statements.<n>Our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks.
arXiv Detail & Related papers (2024-06-11T05:21:37Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs)
We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties.
The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
Unrelated Questions [34.53980255211931]
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense.
Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question.
Despite its simplicity, this lie detector is highly accurate and surprisingly general.
arXiv Detail & Related papers (2023-09-26T16:07:54Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z) - The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks.
One of their most prominent drawbacks is generating inaccurate or false information with a confident tone.
We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.