LENS: A Learnable Evaluation Metric for Text Simplification
- URL: http://arxiv.org/abs/2212.09739v4
- Date: Fri, 7 Jul 2023 20:41:23 GMT
- Title: LENS: A Learnable Evaluation Metric for Text Simplification
- Authors: Mounica Maddela, Yao Dou, David Heineman, Wei Xu
- Abstract summary: We present LENS, a learnable evaluation metric for text simplification.
We also introduce Rank and Rate, a human evaluation framework that rates simplifications from several models in a list-wise manner.
- Score: 17.48383068498169
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Training learnable metrics using modern language models has recently emerged
as a promising method for the automatic evaluation of machine translation.
However, existing human evaluation datasets for text simplification have
limited annotations that are based on unitary or outdated models, making them
unsuitable for this approach. To address these issues, we introduce the
SimpEval corpus that contains: SimpEval_past, comprising 12K human ratings on
2.4K simplifications of 24 past systems, and SimpEval_2022, a challenging
simplification benchmark consisting of over 1K human ratings of 360
simplifications including GPT-3.5 generated text. Training on SimpEval, we
present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive
empirical results show that LENS correlates much better with human judgment
than existing metrics, paving the way for future progress in the evaluation of
text simplification. We also introduce Rank and Rate, a human evaluation
framework that rates simplifications from several models in a list-wise manner
using an interactive interface, which ensures both consistency and accuracy in
the evaluation process and is used to create the SimpEval datasets.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - SemScore: Automated Evaluation of Instruction-Tuned LLMs based on
Semantic Textual Similarity [3.3162484539136416]
We propose a simple but remarkably effective evaluation metric called SemScore.
We compare model outputs to gold target responses using semantic textual similarity (STS)
We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation.
arXiv Detail & Related papers (2024-01-30T14:52:50Z) - Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z) - APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS)
We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect.
Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - Simple-QE: Better Automatic Quality Estimation for Text Simplification [22.222195626377907]
We propose Simple-QE, a BERT-based quality estimation (QE) model adapted from prior summarization QE work.
We show that Simple-QE correlates well with human quality judgments.
We also show that we can adapt this approach to accurately predict the complexity of human-written texts.
arXiv Detail & Related papers (2020-12-22T22:02:37Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.