CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
- URL: http://arxiv.org/abs/2602.23452v1
- Date: Thu, 26 Feb 2026 19:17:39 GMT
- Title: CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
- Authors: Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla, Yanfang Ye,
- Abstract summary: Large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications.<n>We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.<n>Our framework significantly outperforms prior methods in both accuracy and interpretability.
- Score: 51.63024682584688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.
Related papers
- Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking [47.47518672198846]
Misinformation spreading over the Internet poses a significant threat to both societies and individuals.<n>Previous methods rely on semantic and social-contextual patterns learned from training data.<n>We propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence.
arXiv Detail & Related papers (2026-02-27T19:29:01Z) - BibAgent: An Agentic Framework for Traceable Miscitation Detection in Scientific Literature [21.872874595027824]
BibAgent is a scalable, end-to-end agentic framework for automated citation verification.<n>It integrates retrieval, reasoning, and adaptive evidence aggregation, applying strategies for accessible and paywalled sources.<n>Our results demonstrate that BibAgent outperforms state-of-the-art Large Language Model (LLM) baselines in citation verification accuracy and interpretability.
arXiv Detail & Related papers (2026-01-12T16:30:45Z) - SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning [0.0]
We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis.<n>Our approach combines multiple retrieval methods with a four-class classification system that captures nuanced claim-source relationships.<n>We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata.
arXiv Detail & Related papers (2025-11-20T10:05:21Z) - Citation Failure: Definition, Analysis and Efficient Mitigation [56.09968229868067]
Citations from LLM-based RAG systems are supposed to simplify response verification.<n>This does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence.<n>We propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible.
arXiv Detail & Related papers (2025-10-23T07:47:22Z) - VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification [107.75781898355562]
We introduce a novel framework, called VeriCite, designed to rigorously validate supporting evidence and enhance answer attribution.<n>We conduct experiments across five open-source LLMs and four datasets, demonstrating that VeriCite can significantly improve citation quality while maintaining the correctness of the answers.
arXiv Detail & Related papers (2025-10-13T13:38:54Z) - The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research [20.649638393774048]
We introduce a computational pipeline to quantify citation fidelity at scale.<n>Using full texts of papers, the pipeline identifies citations in citing papers and the corresponding claims in cited papers.<n>Using a quasi-experiment, we establish the "telephone effect" - when citing papers have low fidelity to the original claim, future papers that cite the citing paper and the original have lower fidelity to the original.
arXiv Detail & Related papers (2025-02-27T22:47:03Z) - Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation [51.8188846284153]
Attributed Text Generation (ATG) is proposed to enhance credibility and verifiability in RAG systems.<n>This paper proposes ReClaim, a fine-grained ATG method that alternates the generation of references and answers step by step.<n>With extensive experiments, we verify the effectiveness of ReClaim in extensive settings, achieving a citation accuracy rate of 90%.
arXiv Detail & Related papers (2024-07-01T20:47:47Z) - ALiiCE: Evaluating Positional Fine-grained Citation Generation [54.19617927314975]
We propose ALiiCE, the first automatic evaluation framework for fine-grained citation generation.
Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level.
We evaluate the positional fine-grained citation generation performance of several Large Language Models on two long-form QA datasets.
arXiv Detail & Related papers (2024-06-19T09:16:14Z) - Attribution in Scientific Literature: New Benchmark and Methods [41.64918533152914]
Large language models (LLMs) present a promising yet challenging frontier for automated source citation in scientific communication.<n>We introduce REASONS, a novel dataset with sentence-level annotations across 12 scientific domains from arXiv.<n>We conduct extensive experiments with models such as GPT-O1, GPT-4O, GPT-3.5, DeepSeek, and other smaller models like Perplexity AI (7B)
arXiv Detail & Related papers (2024-05-03T16:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.