Related papers: Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

URL: http://arxiv.org/abs/2312.03729v1
Date: Mon, 27 Nov 2023 18:59:14 GMT
Title: Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Authors: Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas
Abstract summary: Neural language models (LMs) can be used to evaluate the truth of factual statements. They can be queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
Score: 53.98071556805525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural language models (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents. Is this an accurate description of today's LMs, or can query-probe disagreement arise in other ways? We identify three different classes of disagreement, which we term confabulation, deception, and heterogeneity. In many cases, the superiority of probes is simply attributable to better calibration on uncertain answers rather than a greater fraction of correct, high-confidence answers. In some cases, queries and probes perform better on different subsets of inputs, and accuracy can further be improved by ensembling the two. Code is available at github.com/lingo-mit/lm-truthfulness.

Related papers

Do Large Language Models Exhibit Spontaneous Rational Deception? [0.913127392774573]
Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol.
arXiv Detail & Related papers (2025-03-31T23:10:56Z)
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs [5.4335487858206735]
We train language models to model impossible and typologically unattested languages.<n>Our results show that while GPT-2 small can largely distinguish attested languages, it does not achieve perfect separation between all the attested languages and all the impossible ones.<n>These findings suggest that LMs exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.
arXiv Detail & Related papers (2025-02-26T04:01:36Z)
LLMs as mediators: Can they diagnose conflicts accurately? [0.0]
We find that OpenAI's Large Language Models GPT 3.5 and GPT 4 can reliably distinguish between causal and moral codes. When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement.
arXiv Detail & Related papers (2024-12-19T09:29:08Z)
Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? [26.69630281310365]
Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries. Existing research on LLMs' perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model's confidence in its response.
arXiv Detail & Related papers (2024-08-19T08:01:11Z)
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers. We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z)
Truth-value judgment in language models: 'truth directions' are context sensitive [2.324913904215885]
Large language models contain directions predictive of the truth of sentences.<n>Multiple methods recover such directions and build probes that are described as uncovering a model's "knowledge" or "beliefs"<n>We investigate this phenomenon, looking closely at the impact of context on the probes.
arXiv Detail & Related papers (2024-04-29T16:52:57Z)
Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries. Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries. Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z)
What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z)
LM vs LM: Detecting Factual Errors via Cross Examination [22.50837561382647]
We propose a factuality evaluation framework for language models (LMs) Our key idea is that an incorrect claim is likely to result in inconsistency with other claims that the model generates. We empirically evaluate our method on factual claims made by multiple recent LMs on four benchmarks.
arXiv Detail & Related papers (2023-05-22T17:42:14Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.