LM vs LM: Detecting Factual Errors via Cross Examination
- URL: http://arxiv.org/abs/2305.13281v1
- Date: Mon, 22 May 2023 17:42:14 GMT
- Title: LM vs LM: Detecting Factual Errors via Cross Examination
- Authors: Roi Cohen, May Hamri, Mor Geva, Amir Globerson
- Abstract summary: We propose a factuality evaluation framework for language models (LMs)
Our key idea is that an incorrect claim is likely to result in inconsistency with other claims that the model generates.
We empirically evaluate our method on factual claims made by multiple recent LMs on four benchmarks.
- Score: 22.50837561382647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A prominent weakness of modern language models (LMs) is their tendency to
generate factually incorrect text, which hinders their usability. A natural
question is whether such factual errors can be detected automatically. Inspired
by truth-seeking mechanisms in law, we propose a factuality evaluation
framework for LMs that is based on cross-examination. Our key idea is that an
incorrect claim is likely to result in inconsistency with other claims that the
model generates. To discover such inconsistencies, we facilitate a multi-turn
interaction between the LM that generated the claim and another LM (acting as
an examiner) which introduces questions to discover inconsistencies. We
empirically evaluate our method on factual claims made by multiple recent LMs
on four benchmarks, finding that it outperforms existing methods and baselines,
often by a large gap. Our results demonstrate the potential of using
interacting LMs for capturing factual errors.
Related papers
- LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations [46.351064535592336]
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures.
Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs.
We show that the internal representations of LLMs encode much more information about truthfulness than previously recognized.
arXiv Detail & Related papers (2024-10-03T17:31:31Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty [53.336235704123915]
We investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties.
We find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses.
We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations.
Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty.
arXiv Detail & Related papers (2024-01-12T18:03:30Z) - Eliciting Latent Knowledge from Quirky Language Models [1.8035046415192353]
Eliciting Latent Knowledge aims to find patterns in a capable neural network's activations that robustly track the true state of the world.
We introduce 12 datasets and a suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions.
We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs.
arXiv Detail & Related papers (2023-12-02T05:47:22Z) - Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters.
We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier.
Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z) - MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language
Models [64.70153487607172]
Language Models (LMs) have shown impressive performance in various natural language tasks.
When it comes to natural language reasoning, LMs still face challenges such as hallucination, generating incorrect intermediate reasoning steps, and making mathematical errors.
Recent research has focused on enhancing LMs through self-improvement using feedback.
In this work, we propose Multi-Aspect Feedback, an iterative refinement framework that integrates multiple feedback modules, including frozen LMs and external tools, each focusing on a specific error category.
arXiv Detail & Related papers (2023-10-19T02:32:39Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models [38.79074982172423]
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.
We propose modeling factual queries as constraint satisfaction problems.
We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations.
arXiv Detail & Related papers (2023-09-26T17:48:55Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.