Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness?
- URL: http://arxiv.org/abs/2312.03729v1
- Date: Mon, 27 Nov 2023 18:59:14 GMT
- Title: Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness?
- Authors: Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas
- Abstract summary: Neural language models (LMs) can be used to evaluate the truth of factual statements.
They can be queried for statement probabilities, or probed for internal representations of truthfulness.
Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs.
This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
- Score: 53.98071556805525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural language models (LMs) can be used to evaluate the truth of factual
statements in two ways: they can be either queried for statement probabilities,
or probed for internal representations of truthfulness. Past work has found
that these two procedures sometimes disagree, and that probes tend to be more
accurate than LM outputs. This has led some researchers to conclude that LMs
"lie" or otherwise encode non-cooperative communicative intents. Is this an
accurate description of today's LMs, or can query-probe disagreement arise in
other ways? We identify three different classes of disagreement, which we term
confabulation, deception, and heterogeneity. In many cases, the superiority of
probes is simply attributable to better calibration on uncertain answers rather
than a greater fraction of correct, high-confidence answers. In some cases,
queries and probes perform better on different subsets of inputs, and accuracy
can further be improved by ensembling the two. Code is available at
github.com/lingo-mit/lm-truthfulness.
Related papers
- LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers.
We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener.
We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z) - What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond?
We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations.
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z) - The Geometry of Truth: Emergent Linear Structure in Large Language Model
Representations of True/False Datasets [7.953477673546057]
Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods.
We present evidence that language models linearly represent the truth or falsehood of factual statements.
We introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.
arXiv Detail & Related papers (2023-10-10T17:54:39Z) - Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models [27.491408293411734]
Large Language Models (LLMs) show promising results in language generation and instruction following but frequently "hallucinate"
Our research introduces a simple redundancy: not all tokens in auto-regressive text equally represent the underlying meaning.
arXiv Detail & Related papers (2023-07-03T22:17:16Z) - LM vs LM: Detecting Factual Errors via Cross Examination [22.50837561382647]
We propose a factuality evaluation framework for language models (LMs)
Our key idea is that an incorrect claim is likely to result in inconsistency with other claims that the model generates.
We empirically evaluate our method on factual claims made by multiple recent LMs on four benchmarks.
arXiv Detail & Related papers (2023-05-22T17:42:14Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.