AttributionBench: How Hard is Automatic Attribution Evaluation?
- URL: http://arxiv.org/abs/2402.15089v1
- Date: Fri, 23 Feb 2024 04:23:33 GMT
- Title: AttributionBench: How Hard is Automatic Attribution Evaluation?
- Authors: Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun
- Abstract summary: We present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets.
Our experiments show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation.
A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information.
- Score: 19.872081697282002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern generative search engines enhance the reliability of large language
model (LLM) responses by providing cited evidence. However, evaluating the
answer's attribution, i.e., whether every claim within the generated responses
is fully supported by its cited evidence, remains an open problem. This
verification, traditionally dependent on costly human evaluation, underscores
the urgent need for automatic attribution evaluation methods. To bridge the gap
in the absence of standardized benchmarks for these methods, we present
AttributionBench, a comprehensive benchmark compiled from various existing
attribution datasets. Our extensive experiments on AttributionBench reveal the
challenges of automatic attribution evaluation, even for state-of-the-art LLMs.
Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves
around 80% macro-F1 under a binary classification formulation. A detailed
analysis of more than 300 error cases indicates that a majority of failures
stem from the model's inability to process nuanced information, and the
discrepancy between the information the model has access to and that human
annotators do.
Related papers
- YourBench: Easy Custom Evaluation Sets for Everyone [12.995134931278056]
YourBench is a novel, open-source framework for evaluating large language models (LLMs)
It generates reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation.
We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces.
arXiv Detail & Related papers (2025-04-02T15:40:24Z) - Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [36.449658676568234]
Large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs.
We propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.
Our comprehensive study reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models.
arXiv Detail & Related papers (2025-03-19T18:09:19Z) - FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.
We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.
Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation [4.773086022844023]
We present VERIFY, a pipeline to evaluate LMs' factuality in real-world user interactions.
Verification considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable.
We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FactBench.
arXiv Detail & Related papers (2024-10-29T17:19:56Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Automatic Evaluation of Attribution by Large Language Models [24.443271739599194]
We investigate the automatic evaluation of attribution given by large language models (LLMs)
We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation.
We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing.
arXiv Detail & Related papers (2023-05-10T16:58:33Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia.
In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim.
We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.