Related papers: Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

URL: http://arxiv.org/abs/2509.21557v1
Date: Thu, 25 Sep 2025 20:39:26 GMT
Title: Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution
Authors: Yash Saxena, Raviteja Bommireddy, Ankur Padia, Manas Gaur,
Abstract summary: Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance.<n>We introduce two paradigms: Generation-Time Citation (G-Cite) which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite) which adds or verifies citations after drafting.<n>Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms.
Score: 8.691344810384114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/

Related papers

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era [51.63024682584688]
Large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications.<n>We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.<n>Our framework significantly outperforms prior methods in both accuracy and interpretability.
arXiv Detail & Related papers (2026-02-26T19:17:39Z)
Citation Failure: Definition, Analysis and Efficient Mitigation [56.09968229868067]
Citations from LLM-based RAG systems are supposed to simplify response verification.<n>This does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence.<n>We propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible.
arXiv Detail & Related papers (2025-10-23T07:47:22Z)
VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification [107.75781898355562]
We introduce a novel framework, called VeriCite, designed to rigorously validate supporting evidence and enhance answer attribution.<n>We conduct experiments across five open-source LLMs and four datasets, demonstrating that VeriCite can significantly improve citation quality while maintaining the correctness of the answers.
arXiv Detail & Related papers (2025-10-13T13:38:54Z)
Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation [28.229130944067787]
In RAG question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations.<n>First, the citations are typically provided at the sentence or even paragraph level.<n>Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context.<n>We propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output.
arXiv Detail & Related papers (2025-09-25T07:50:30Z)
SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment [2.0383262889621867]
We propose the SciRGC framework, which aims to automatically recommend citation articles and generate citation sentences for citation locations within articles.<n>The framework addresses two key challenges in academic citation generation: 1) how to accurately identify the author's citation intent and find relevant citation papers, and 2) how to generate high-quality citation sentences that align with human preferences.
arXiv Detail & Related papers (2025-05-26T15:09:10Z)
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models [51.90867482317985]
SelfCite is a self-supervised approach to generate fine-grained, sentence-level citations for statements in generated responses.<n>The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark.
arXiv Detail & Related papers (2025-02-13T18:55:13Z)
ALiiCE: Evaluating Positional Fine-grained Citation Generation [54.19617927314975]
We propose ALiiCE, the first automatic evaluation framework for fine-grained citation generation. Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level. We evaluate the positional fine-grained citation generation performance of several Large Language Models on two long-form QA datasets.
arXiv Detail & Related papers (2024-06-19T09:16:14Z)
Learning to Generate Answers with Citations via Factual Consistency Models [28.716998866121923]
Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs) Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens.
arXiv Detail & Related papers (2024-06-19T00:40:19Z)
Towards generating citation sentences for multiple references with intent control [86.53829532976303]
We build a novel generation model with the Fusion-in-Decoder approach to cope with multiple long inputs. Experiments demonstrate that the proposed approaches provide much more comprehensive features for generating citation sentences.
arXiv Detail & Related papers (2021-12-02T15:32:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.