Rethinking Automatic Evaluation in Sentence Simplification
- URL: http://arxiv.org/abs/2104.07560v2
- Date: Fri, 16 Apr 2021 08:37:14 GMT
- Title: Rethinking Automatic Evaluation in Sentence Simplification
- Authors: Thomas Scialom, Louis Martin, Jacopo Staiano, \'Eric Villemonte de la
Clergerie, Beno\^it Sagot
- Abstract summary: We propose a simple modification of QuestEval allowing it to tackle Sentence Simplification.
We show that the latter obtain state-of-the-art correlations, outperforming standard metrics like BLEU and SARI.
We release a new corpus of evaluated simplifications, this time not generated by systems but instead, written by humans.
- Score: 10.398614920404727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation remains an open research question in Natural Language
Generation. In the context of Sentence Simplification, this is particularly
challenging: the task requires by nature to replace complex words with simpler
ones that shares the same meaning. This limits the effectiveness of n-gram
based metrics like BLEU. Going hand in hand with the recent advances in NLG,
new metrics have been proposed, such as BERTScore for Machine Translation. In
summarization, the QuestEval metric proposes to automatically compare two texts
by questioning them.
In this paper, we first propose a simple modification of QuestEval allowing
it to tackle Sentence Simplification. We then extensively evaluate the
correlations w.r.t. human judgement for several metrics including the recent
BERTScore and QuestEval, and show that the latter obtain state-of-the-art
correlations, outperforming standard metrics like BLEU and SARI. More
importantly, we also show that a large part of the correlations are actually
spurious for all the metrics. To investigate this phenomenon further, we
release a new corpus of evaluated simplifications, this time not generated by
systems but instead, written by humans. This allows us to remove the spurious
correlations and draw very different conclusions from the original ones,
resulting in a better understanding of these metrics. In particular, we raise
concerns about very low correlations for most of traditional metrics. Our
results show that the only significant measure of the Meaning Preservation is
our adaptation of QuestEval.
Related papers
- Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation [21.650619533772232]
This work investigates whether and to what degree superficial attributes of summary texts suffice to predict factuality''
We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements.
Motivated by these insights, we show that one can game'' (most) automatic factuality metrics, i.e., reliably inflate factuality'' scores by appending innocuous sentences to generated summaries.
arXiv Detail & Related papers (2024-11-25T18:15:15Z) - Evaluating Document Simplification: On the Importance of Separately Assessing Simplicity and Meaning Preservation [9.618393813409266]
This paper focuses on the evaluation of document-level text simplification.
We compare existing models using distinct metrics for meaning preservation and simplification.
We introduce a reference-less metric variant for simplicity, showing that models are mostly biased towards either simplification or meaning preservation.
arXiv Detail & Related papers (2024-04-04T08:04:24Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Simplicity Level Estimate (SLE): A Learned Reference-Less Metric for
Sentence Simplification [8.479659578608233]
We propose a new learned evaluation metric (SLE) for sentence simplification.
SLE focuses on simplicity, outperforming almost all existing metrics in terms of correlation with human judgements.
arXiv Detail & Related papers (2023-10-12T09:49:10Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - On Quantitative Evaluations of Counterfactuals [88.42660013773647]
This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments.
We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases.
We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
arXiv Detail & Related papers (2021-10-30T05:00:36Z) - Global Explainability of BERT-Based Evaluation Metrics by Disentangling
along Linguistic Factors [14.238125731862658]
We disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap.
We show that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to lexical overlap, just like BLEU and ROUGE.
arXiv Detail & Related papers (2021-10-08T22:40:33Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.