Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis
- URL: http://arxiv.org/abs/2210.05035v1
- Date: Mon, 10 Oct 2022 22:30:26 GMT
- Title: Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis
- Authors: Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang
Wang
- Abstract summary: We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
- Score: 79.18261352971284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Is it possible to build a general and automatic natural language generation
(NLG) evaluation metric? Existing learned metrics either perform
unsatisfactorily or are restricted to tasks where large human rating data is
already available. We introduce SESCORE, a model-based metric that is highly
correlated with human judgements without requiring human annotation, by
utilizing a novel, iterative error synthesis and severity scoring pipeline.
This pipeline applies a series of plausible errors to raw text and assigns
severity labels by simulating human judgements with entailment. We evaluate
SESCORE against existing metrics by comparing how their scores correlate with
human ratings. SESCORE outperforms all prior unsupervised metrics on multiple
diverse NLG tasks including machine translation, image captioning, and WebNLG
text generation. For WMT 20/21 En-De and Zh-En, SESCORE improve the average
Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even
achieves comparable performance to the best supervised metric COMET, despite
receiving no human-annotated training data.
Related papers
- TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks [44.801746603656504]
We present TIGERScore, a metric that follows textbfInstruction textbfGuidance to perform textbfExplainable and textbfReference-free evaluation.
Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct.
arXiv Detail & Related papers (2023-10-01T18:01:51Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of
Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP)
BUMP is a dataset of 889 human-written, minimally different summary pairs.
Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z) - SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic
Mistakes [93.19166902594168]
We propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation.
Key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus.
We evaluate SESCORE2 and previous methods on four text generation tasks across three languages.
arXiv Detail & Related papers (2022-12-19T09:02:16Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.