Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness?
- URL: http://arxiv.org/abs/2312.03729v1
- Date: Mon, 27 Nov 2023 18:59:14 GMT
- Title: Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness?
- Authors: Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas
- Abstract summary: Neural language models (LMs) can be used to evaluate the truth of factual statements.
They can be queried for statement probabilities, or probed for internal representations of truthfulness.
Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs.
This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
- Score: 53.98071556805525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural language models (LMs) can be used to evaluate the truth of factual
statements in two ways: they can be either queried for statement probabilities,
or probed for internal representations of truthfulness. Past work has found
that these two procedures sometimes disagree, and that probes tend to be more
accurate than LM outputs. This has led some researchers to conclude that LMs
"lie" or otherwise encode non-cooperative communicative intents. Is this an
accurate description of today's LMs, or can query-probe disagreement arise in
other ways? We identify three different classes of disagreement, which we term
confabulation, deception, and heterogeneity. In many cases, the superiority of
probes is simply attributable to better calibration on uncertain answers rather
than a greater fraction of correct, high-confidence answers. In some cases,
queries and probes perform better on different subsets of inputs, and accuracy
can further be improved by ensembling the two. Code is available at
github.com/lingo-mit/lm-truthfulness.
Related papers
- Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? [26.69630281310365]
Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries.
Existing research on LLMs' perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model's confidence in its response.
arXiv Detail & Related papers (2024-08-19T08:01:11Z) - LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers.
We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener.
We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z) - Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries.
Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries.
Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z) - What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond?
We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations.
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z) - The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods.
Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations.
We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z) - LM vs LM: Detecting Factual Errors via Cross Examination [22.50837561382647]
We propose a factuality evaluation framework for language models (LMs)
Our key idea is that an incorrect claim is likely to result in inconsistency with other claims that the model generates.
We empirically evaluate our method on factual claims made by multiple recent LMs on four benchmarks.
arXiv Detail & Related papers (2023-05-22T17:42:14Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.