GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
- URL: http://arxiv.org/abs/2602.06718v1
- Date: Fri, 06 Feb 2026 14:08:34 GMT
- Title: GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
- Authors: Zuyao Xu, Yuqi Qiu, Lu Sun, FaSheng Miao, Fubin Wu, Xinyi Wang, Xiang Li, Haozhe Lu, ZhengZe Zhang, Yuxin Hu, Jialu Li, Jin Luo, Feng Zhang, Rui Luo, Xinran Liu, Yingxian Li, Jiaji Liu,
- Abstract summary: Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses.<n>With the advent of Large Language Models (LLMs), this risk has intensified.<n>We develop CiteVerifier, an open-source framework for large-scale citation verification.
- Score: 22.147294042024836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, yet their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat and inform mitigation, we develop CiteVerifier, an open-source framework for large-scale citation verification, and conduct the first comprehensive study of citation validity in the LLM era through three experiments built on it. We benchmark 13 state-of-the-art LLMs on citation generation across 40 research domains, finding that all models hallucinate citations at rates from 14.23\% to 94.93\%, with significant variation across research domains. Moreover, we analyze 2.2 million citations from 56,381 papers published at top-tier AI/ML and Security venues (2020--2025), confirming that 1.07\% of papers contain invalid or fabricated citations (604 papers), with an 80.9\% increase in 2025 alone. Furthermore, we survey 97 researchers and analyze 94 valid responses after removing 3 conflicting samples, revealing a critical ``verification gap'': 41.5\% of researchers copy-paste BibTeX without checking and 44.4\% choose no-action responses when encountering suspicious references; meanwhile, 76.7\% of reviewers do not thoroughly check references and 80.0\% never suspect fake citations. Our findings reveal an accelerating crisis where unreliable AI tools, combined with inadequate human verification by researchers and insufficient peer review scrutiny, enable fabricated citations to contaminate the scientific record. We propose interventions for researchers, venues, and tool developers to protect citation integrity.
Related papers
- CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era [51.63024682584688]
Large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications.<n>We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.<n>Our framework significantly outperforms prior methods in both accuracy and interpretability.
arXiv Detail & Related papers (2026-02-26T19:17:39Z) - Compound Deception in Elite Peer Review: A Failure Mode Taxonomy of 100 Fabricated Citations at NeurIPS 2025 [0.0]
Large language models (LLMs) are increasingly used in academic writing, yet they frequently hallucinate by generating citations to sources that do not exist.<n>This study analyzes 100 AI-generated hallucinated citations that appeared in papers accepted by the 2025 Conference on Neural Information Processing Systems.<n>Despite review by 3-5 expert researchers per paper, these fabricated citations evaded detection, appearing in 53 published papers.
arXiv Detail & Related papers (2026-02-05T17:43:35Z) - The 17% Gap: Quantifying Epistemic Decay in AI-Assisted Survey Papers [0.0]
"Hallucinated papers" are a known artifact, but the systematic degradation of valid citation chains remains unquantified.<n>We conducted a forensic audit of 50 recent survey papers in Artificial Intelligence published between September 2024 and January 2026.<n>We detect a persistent 17.0% Phantom Rate -- citations that cannot be resolved to any digital object despite aggressive forensic recovery.
arXiv Detail & Related papers (2026-01-24T12:00:55Z) - AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications [71.27518152526686]
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation.<n>LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task.<n>This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types.
arXiv Detail & Related papers (2025-12-23T08:42:09Z) - DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence [50.97612134791782]
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices.<n>We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations.
arXiv Detail & Related papers (2025-09-02T00:32:38Z) - The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research [20.649638393774048]
We introduce a computational pipeline to quantify citation fidelity at scale.<n>Using full texts of papers, the pipeline identifies citations in citing papers and the corresponding claims in cited papers.<n>Using a quasi-experiment, we establish the "telephone effect" - when citing papers have low fidelity to the original claim, future papers that cite the citing paper and the original have lower fidelity to the original.
arXiv Detail & Related papers (2025-02-27T22:47:03Z) - Automated Review Generation Method Based on Large Language Models [8.86304208754684]
We present an automated review generation method based on large language models (LLMs)<n>Our method swiftly analyzed 343 articles, averaging seconds per article per LLM account, producing comprehensive reviews spanning 35 topics, with extended analysis of 1041 articles.
arXiv Detail & Related papers (2024-07-30T15:26:36Z) - ALiiCE: Evaluating Positional Fine-grained Citation Generation [54.19617927314975]
We propose ALiiCE, the first automatic evaluation framework for fine-grained citation generation.
Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level.
We evaluate the positional fine-grained citation generation performance of several Large Language Models on two long-form QA datasets.
arXiv Detail & Related papers (2024-06-19T09:16:14Z) - Attribution in Scientific Literature: New Benchmark and Methods [41.64918533152914]
Large language models (LLMs) present a promising yet challenging frontier for automated source citation in scientific communication.<n>We introduce REASONS, a novel dataset with sentence-level annotations across 12 scientific domains from arXiv.<n>We conduct extensive experiments with models such as GPT-O1, GPT-4O, GPT-3.5, DeepSeek, and other smaller models like Perplexity AI (7B)
arXiv Detail & Related papers (2024-05-03T16:38:51Z) - Position: AI/ML Influencers Have a Place in the Academic Process [82.2069685579588]
We investigate the role of social media influencers in enhancing the visibility of machine learning research.
We have compiled a comprehensive dataset of over 8,000 papers, spanning tweets from December 2018 to October 2023.
Our statistical and causal inference analysis reveals a significant increase in citations for papers endorsed by these influencers.
arXiv Detail & Related papers (2024-01-24T20:05:49Z) - Deep Graph Learning for Anomalous Citation Detection [55.81334139806342]
We propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks.
Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts.
arXiv Detail & Related papers (2022-02-23T09:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.