Evaluating Factual Consistency of Texts with Semantic Role Labeling
- URL: http://arxiv.org/abs/2305.13309v1
- Date: Mon, 22 May 2023 17:59:42 GMT
- Title: Evaluating Factual Consistency of Texts with Semantic Role Labeling
- Authors: Jing Fan, Dennis Aumiller, Michael Gertz
- Abstract summary: We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
- Score: 3.1776833268555134
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automated evaluation of text generation systems has recently seen increasing
attention, particularly checking whether generated text stays truthful to input
sources. Existing methods frequently rely on an evaluation using task-specific
language models, which in turn allows for little interpretability of generated
scores. We introduce SRLScore, a reference-free evaluation metric designed with
text summarization in mind. Our approach generates fact tuples constructed from
Semantic Role Labels, applied to both input and summary texts. A final
factuality score is computed by an adjustable scoring mechanism, which allows
for easy adaption of the method across domains. Correlation with human
judgments on English summarization datasets shows that SRLScore is competitive
with state-of-the-art methods and exhibits stable generalization across
datasets without requiring further training or hyperparameter tuning. We
experiment with an optional co-reference resolution step, but find that the
performance boost is mostly outweighed by the additional compute required. Our
metric is available online at https://github.com/heyjing/SRLScore.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - Label Agnostic Pre-training for Zero-shot Text Classification [4.9081735096855565]
In real-world applications, there exists an infinite label space for describing a given text.
We introduce two new simple yet effective pre-training strategies, Implicit and Explicit pre-training.
These methods inject aspect-level understanding into the model at train time with the goal of conditioning the model to build task-level understanding.
arXiv Detail & Related papers (2023-05-25T22:55:32Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - BARTScore: Evaluating Generated Text as Text Generation [89.50052670307434]
We conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models.
We operationalize this idea using BART, an encoder-decoder based pre-trained model.
We propose a metric BARTScore with a number of variants that can be flexibly applied to evaluation of text from different perspectives.
arXiv Detail & Related papers (2021-06-22T03:20:53Z) - Data Augmentation in Natural Language Processing: A Novel Text
Generation Approach for Long and Short Text Classifiers [8.19984844136462]
We present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts.
In a simulated low data regime additive accuracy gains of up to 15.53% are achieved.
We discuss implications and patterns for the successful application of our approach on different types of datasets.
arXiv Detail & Related papers (2021-03-26T13:16:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.