Related papers: Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries

Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries

URL: http://arxiv.org/abs/2509.25498v1
Date: Mon, 29 Sep 2025 20:55:43 GMT
Title: Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries
Authors: Nick Hagar, Wilma Agustianto, Nicholas Diakopoulos,
Abstract summary: Large language models (LLMs) are increasingly used in newsrooms.<n>Their tendency to hallucinate poses risks to core journalistic practices of sourcing, attribution, and accuracy.<n>We evaluate three widely used tools - ChatGPT, Gemini, and NotebookLM.
Score: 2.853035319109148
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in newsroom workflows, but their tendency to hallucinate poses risks to core journalistic practices of sourcing, attribution, and accuracy. We evaluate three widely used tools - ChatGPT, Gemini, and NotebookLM - on a reporting-style task grounded in a 300-document corpus related to TikTok litigation and policy in the U.S. We vary prompt specificity and context size and annotate sentence-level outputs using a taxonomy to measure hallucination type and severity. Across our sample, 30% of model outputs contained at least one hallucination, with rates approximately three times higher for Gemini and ChatGPT (40%) than for NotebookLM (13%). Qualitatively, most errors did not involve invented entities or numbers; instead, we observed interpretive overconfidence - models added unsupported characterizations of sources and transformed attributed opinions into general statements. These patterns reveal a fundamental epistemological mismatch: While journalism requires explicit sourcing for every claim, LLMs generate authoritative-sounding text regardless of evidentiary support. We propose journalism-specific extensions to existing hallucination taxonomies and argue that effective newsroom tools need architectures that enforce accurate attribution rather than optimize for fluency.

Related papers

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era [51.63024682584688]
Large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications.<n>We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.<n>Our framework significantly outperforms prior methods in both accuracy and interpretability.
arXiv Detail & Related papers (2026-02-26T19:17:39Z)
Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning [79.95774256444956]
The lack of reasoning capabilities in Vision-Language Models has remained at the forefront of research discourse.<n>We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics.
arXiv Detail & Related papers (2026-02-26T18:54:06Z)
Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking? [0.0]
This paper focuses on fine-grained evidence extraction for Czech and Slovak claims.<n>We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators.<n>We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations.
arXiv Detail & Related papers (2025-11-26T13:51:59Z)
On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search [2.853035319109148]
Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery.<n>We present a journalist-centered approach to search that prioritizes transparency and editorial control through a five-stage pipeline.<n>We evaluate three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora and find substantial variation in reliability.
arXiv Detail & Related papers (2025-09-29T20:50:40Z)
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z)
Evaluating Large Language Models as Expert Annotators [17.06186816803593]
This paper investigates whether top-performing language models can serve as direct alternatives to human expert annotators.<n>We evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law.<n>Our empirical results reveal that individual LLMs equipped with inference-time techniques show only marginal or even negative performance gains.
arXiv Detail & Related papers (2025-08-11T10:19:10Z)
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure [98.71456610527598]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z)
Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts [29.95198868148809]
We propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet.<n>We provide an in-depth error analysis of the effect of media popularity and region on model performance.
arXiv Detail & Related papers (2025-06-14T15:49:20Z)
From Small to Large Language Models: Revisiting the Federalist Papers [0.0]
We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification.<n>We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution.
arXiv Detail & Related papers (2025-02-25T21:50:46Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency in Large Language Model (LLM)-generated content.<n>We introduce the Language Agency Bias Evaluation benchmark, which comprehensively evaluates biases in LLMs.<n>Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral.
arXiv Detail & Related papers (2024-04-16T12:27:54Z)
FABLES: Evaluating faithfulness and content selection in book-length summarization [55.50680057160788]
In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on book-length documents. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
arXiv Detail & Related papers (2024-04-01T17:33:38Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.