Related papers: Truth-value judgment in language models: 'truth directions' are context sensitive

Truth-value judgment in language models: 'truth directions' are context sensitive

URL: http://arxiv.org/abs/2404.18865v3
Date: Fri, 11 Jul 2025 07:37:47 GMT
Title: Truth-value judgment in language models: 'truth directions' are context sensitive
Authors: Stefan F. Schouten, Peter Bloem, Ilia Markov, Piek Vossen,
Abstract summary: Large language models contain directions predictive of the truth of sentences.<n>Multiple methods recover such directions and build probes that are described as uncovering a model's "knowledge" or "beliefs"<n>We investigate this phenomenon, looking closely at the impact of context on the probes.
Score: 2.324913904215885
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as uncovering a model's "knowledge" or "beliefs". We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe's predictions are (most) sensitive to the presence of related sentences, and how to best characterize this kind of sensitivity. We do so by measuring different types of consistency errors that occur after probing an LLM whose inputs consist of hypotheses preceded by (negated) supporting and contradicting sentences. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these truth-value directions influences the position of an entailed or contradicted sentence along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the model, and the kind of data. Finally, our results suggest that truth-value directions are causal mediators in the inference process that incorporates in-context information.

Related papers

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models [29.67884478799914]
Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness.
arXiv Detail & Related papers (2024-10-18T03:45:42Z)
How Entangled is Factuality and Deception in German? [10.790059579736276]
Research on deception detection and fact checking often conflates factual accuracy with the truthfulness of statements. The belief-based deception framework disentangles these properties by defining texts as deceptive when there is a mismatch between what people say and what they truly believe. We test the effectiveness of computational models in detecting deception using an established corpus of belief-based argumentation.
arXiv Detail & Related papers (2024-09-30T10:23:13Z)
Large Language Models are Skeptics: False Negative Problem of Input-conflicting Hallucination [36.01680298955394]
We identify a new category of bias that induces input-conflicting hallucinations. We show that large language models (LLMs) generate responses inconsistent with the content of the input context.
arXiv Detail & Related papers (2024-06-20T01:53:25Z)
Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations. We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z)
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? [53.98071556805525]
Neural language models (LMs) can be used to evaluate the truth of factual statements. They can be queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
arXiv Detail & Related papers (2023-11-27T18:59:14Z)
Is Probing All You Need? Indicator Tasks as an Alternative to Probing Embedding Spaces [19.4968960182412]
We introduce the term indicator tasks for non-trainable tasks which are used to query embedding spaces for the existence of certain properties. We show that the application of a suitable indicator provides a more accurate picture of the information captured and removed compared to probes.
arXiv Detail & Related papers (2023-10-24T15:08:12Z)
Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models [74.07684768317705]
LMs are highly sensitive to markers of certainty in prompts, with accuies varying more than 80%. We find that expressions of high certainty result in a decrease in accuracy as compared to low expressions; similarly, factive verbs hurt performance, while evidentials benefit performance. These associations may suggest that LMs is based on observed language use, rather than truly reflecting uncertainty.
arXiv Detail & Related papers (2023-02-26T23:46:29Z)
Mind Your Bias: A Critical Review of Bias Detection Methods for Contextual Language Models [2.170169149901781]
We conduct a rigorous analysis and comparison of bias detection methods for contextual language models. Our results show that minor design and implementation decisions (or errors) have a substantial and often significant impact on the derived bias scores.
arXiv Detail & Related papers (2022-11-15T19:27:54Z)
Uncertain Evidence in Probabilistic Models and Stochastic Simulators [80.40110074847527]
We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as uncertain evidence' We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We devise concrete guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency.
arXiv Detail & Related papers (2022-10-21T20:32:59Z)
Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish. Using our approach, we isolate morpho-syntactic features from counfounders in sentences. We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z)
Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text Correspondence [45.9949173746044]
We show that large-size pre-trained language models (PLMs) do not satisfy the logical negation property (LNP) We propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence. We find that the task enables PLMs to learn lexical semantic information.
arXiv Detail & Related papers (2022-05-08T08:37:36Z)
AmbiFC: Fact-Checking Ambiguous Claims with Evidence [57.7091560922174]
We present AmbiFC, a fact-checking dataset with 10k claims derived from real-world information needs. We analyze disagreements arising from ambiguity when comparing claims against evidence in AmbiFC. We develop models for predicting veracity handling this ambiguity via soft labels.
arXiv Detail & Related papers (2021-04-01T17:40:08Z)
Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks [84.61578555312288]
We introduce a method for the prediction of disambiguation errors based on statistical data properties. We develop a simple adversarial attack strategy that minimally perturbs sentences in order to elicit disambiguation errors. Our findings indicate that disambiguation robustness varies substantially between domains and that different models trained on the same data are vulnerable to different attacks.
arXiv Detail & Related papers (2020-11-03T17:01:44Z)
Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals [53.484562601127195]
We point out the inability to infer behavioral conclusions from probing results. We offer an alternative method that focuses on how the information is being used, rather than on what information is encoded.
arXiv Detail & Related papers (2020-06-01T15:00:11Z)
CausalVAE: Structured Causal Disentanglement in Variational Autoencoder [52.139696854386976]
The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. We propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent factors into causal endogenous ones. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy.
arXiv Detail & Related papers (2020-04-18T20:09:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.