SMART: Sentences as Basic Units for Text Evaluation
- URL: http://arxiv.org/abs/2208.01030v1
- Date: Mon, 1 Aug 2022 17:58:05 GMT
- Title: SMART: Sentences as Basic Units for Text Evaluation
- Authors: Reinald Kim Amplayo, Peter J. Liu, Yao Zhao, Shashi Narayan
- Abstract summary: In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
- Score: 48.5999587529085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Widely used evaluation metrics for text generation either do not work well
with longer texts or fail to evaluate all aspects of text quality. In this
paper, we introduce a new metric called SMART to mitigate such limitations.
Specifically, We treat sentences as basic units of matching instead of tokens,
and use a sentence matching function to soft-match candidate and reference
sentences. Candidate sentences are also compared to sentences in the source
documents to allow grounding (e.g., factuality) evaluation. Our results show
that system-level correlations of our proposed metric with a model-based
matching function outperforms all competing metrics on the SummEval
summarization meta-evaluation dataset, while the same metric with a
string-based matching function is competitive with current model-based metrics.
The latter does not use any neural model, which is useful during model
development phases where resources can be limited and fast evaluation is
required. Finally, we also conducted extensive analyses showing that our
proposed metrics work well with longer summaries and are less biased towards
specific models.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - We Need to Talk About Classification Evaluation Metrics in NLP [34.73017509294468]
In Natural Language Processing (NLP) model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC.
The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use.
We demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance.
arXiv Detail & Related papers (2024-01-08T11:40:48Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.