CiteEval: Principle-Driven Citation Evaluation for Source Attribution
- URL: http://arxiv.org/abs/2506.01829v1
- Date: Mon, 02 Jun 2025 16:15:34 GMT
- Title: CiteEval: Principle-Driven Citation Evaluation for Source Attribution
- Authors: Yumo Xu, Peng Qi, Jifan Chen, Kunlun Liu, Rujun Han, Lan Liu, Bonan Min, Vittorio Castelli, Arshit Gupta, Zhiguo Wang,
- Abstract summary: CiteEval is a citation evaluation framework driven by principles focusing on fine-grained citation assessment.<n>CiteBench is a benchmark with high-quality human annotations on citation quality.<n>CiteEval-Auto is a suite of model-based metrics that exhibit strong correlation with human judgments.
- Score: 38.24323805177938
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.
Related papers
- SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models [51.90867482317985]
SelfCite is a self-supervised approach to generate fine-grained, sentence-level citations for statements in generated responses.<n>The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark.
arXiv Detail & Related papers (2025-02-13T18:55:13Z) - A Comparative Analysis of Faithfulness Metrics and Humans in Citation Evaluation [22.041561519672456]
Large language models (LLMs) often generate content with unsupported or unverifiable content, known as "hallucinations"
We propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels.
Our results indicate no single metric consistently excels across all evaluations, highlighting the complexity of accurately evaluating fine-grained support levels.
arXiv Detail & Related papers (2024-08-22T13:44:31Z) - Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics [22.041561519672456]
Large language models (LLMs) often produce unsupported or unverifiable content, known as "hallucinations"
We propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels.
Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support.
arXiv Detail & Related papers (2024-06-21T15:57:24Z) - ALiiCE: Evaluating Positional Fine-grained Citation Generation [54.19617927314975]
We propose ALiiCE, the first automatic evaluation framework for fine-grained citation generation.
Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level.
We evaluate the positional fine-grained citation generation performance of several Large Language Models on two long-form QA datasets.
arXiv Detail & Related papers (2024-06-19T09:16:14Z) - ElicitationGPT: Text Elicitation Mechanisms via Language Models [12.945581341789431]
This paper develops mechanisms for scoring elicited text against ground truth text using domain-knowledge-free queries to a large language model.
An empirical evaluation is conducted on peer reviews from a peer-grading dataset and in comparison to manual instructor scores for the peer reviews.
arXiv Detail & Related papers (2024-06-13T17:49:10Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.<n>SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - Learning Neural Textual Representations for Citation Recommendation [7.227232362460348]
We propose a novel approach to citation recommendation using a deep sequential representation of the documents (Sentence-BERT) cascaded with Siamese and triplet networks in a submodular scoring function.
To the best of our knowledge, this is the first approach to combine deep representations and submodular selection for a task of citation recommendation.
arXiv Detail & Related papers (2020-07-08T12:38:50Z) - Context-Based Quotation Recommendation [60.93257124507105]
We propose a novel context-aware quote recommendation system.
It generates a ranked list of quotable paragraphs and spans of tokens from a given source document.
We conduct experiments on a collection of speech transcripts and associated news articles.
arXiv Detail & Related papers (2020-05-17T17:49:53Z) - HybridCite: A Hybrid Model for Context-Aware Citation Recommendation [0.0]
We develop citation recommendation approaches based on embeddings, topic modeling, and information retrieval techniques.
We combine, for the first time to the best of our knowledge, the best-performing algorithms into a semi-genetic hybrid recommender system.
Our evaluation results show that a hybrid model containing embedding and information retrieval-based components outperforms its individual components and further algorithms by a large margin.
arXiv Detail & Related papers (2020-02-15T16:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.