VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
- URL: http://arxiv.org/abs/2406.19276v1
- Date: Thu, 27 Jun 2024 15:43:18 GMT
- Title: VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
- Authors: Yixiao Song, Yekyung Kim, Mohit Iyyer,
- Abstract summary: Existing metrics for evaluating the factuality of long-form text decompose an input text into "atomic claims" and verify each against a knowledge base like Wikipedia.
We address this issue with VERISCORE, a metric for diverse long-form generation tasks that contain both verifiable and unverifiable content.
We use VERISCORE to evaluate generations from 16 different models across multiple long-form tasks and find that while GPT-4o is the best-performing model overall, open-weight models such as Mixtral-8x22 are closing the gap.
- Score: 37.16881685930474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et al., 2023) and SAFE (Wei et al., 2024), decompose an input text into "atomic claims" and verify each against a knowledge base like Wikipedia. These metrics are not suitable for most generation tasks because they assume that every claim is verifiable (i.e., can plausibly be proven true or false). We address this issue with VERISCORE, a metric for diverse long-form generation tasks that contain both verifiable and unverifiable content. VERISCORE can be effectively implemented with either closed or fine-tuned open-weight language models, and human evaluation confirms that VERISCORE's extracted claims are more sensible than those from competing methods across eight different long-form tasks. We use VERISCORE to evaluate generations from 16 different models across multiple long-form tasks and find that while GPT-4o is the best-performing model overall, open-weight models such as Mixtral-8x22 are closing the gap. We show that an LM's VERISCORE on one task (e.g., biography generation) does not necessarily correlate to its VERISCORE on a different task (e.g., long-form QA), highlighting the need for expanding factuality evaluation across tasks with varying fact density.
Related papers
- Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.
In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.
We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z) - On the Robust Approximation of ASR Metrics [30.524282767961463]
We propose a novel approach for approximating ASR performance metrics, eliminating the need for ground truth labels.
Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics.
Our results show that we approximate the metrics within a single-digit absolute difference across all experimental configurations, outperforming the most recent baseline by more than 50%.
arXiv Detail & Related papers (2025-02-18T01:10:17Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction [10.428174043080622]
Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents.
We propose SWiM, an evaluation framework that addresses the limitations of standard tests.
We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
arXiv Detail & Related papers (2024-07-04T05:46:20Z) - COVID-19 Twitter Sentiment Classification Using Hybrid Deep Learning Model Based on Grid Search Methodology [0.0]
The sentiment prediction is achieved using embedding, deep learning model and grid search algorithm on Twitter COVID-19 dataset.
According to the study, public sentiment towards COVID-19 immunization appears to be improving with time.
arXiv Detail & Related papers (2024-06-11T07:48:06Z) - Long-form factuality in large language models [60.07181269469043]
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics.
We benchmark a model's long-form factuality in open domains, using GPT-4 to generate LongFact.
We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE)
arXiv Detail & Related papers (2024-03-27T17:48:55Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Fast and Accurate Factual Inconsistency Detection Over Long Documents [19.86348214462828]
We introduce SCALE, a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy.
This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs.
We have released our code and data publicly to GitHub.
arXiv Detail & Related papers (2023-10-19T22:55:39Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation [92.1582872870226]
We propose a new grounded keys-to-text generation task.
The task is to generate a factual description about an entity given a set of guiding keys, and grounding passages.
Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions.
arXiv Detail & Related papers (2022-12-04T23:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.