Related papers: Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

URL: http://arxiv.org/abs/2210.05035v1
Date: Mon, 10 Oct 2022 22:30:26 GMT
Title: Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Authors: Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang
Abstract summary: We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
Score: 79.18261352971284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21 En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.

Related papers

TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks [44.801746603656504]
We present TIGERScore, a metric that follows textbfInstruction textbfGuidance to perform textbfExplainable and textbfReference-free evaluation. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct.
arXiv Detail & Related papers (2023-10-01T18:01:51Z)
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation. We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP) BUMP is a dataset of 889 human-written, minimally different summary pairs. Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z)
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes [93.19166902594168]
We propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. Key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages.
arXiv Detail & Related papers (2022-12-19T09:02:16Z)
CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models. Eval assembles the generation probabilities from a pre-trained language model without any model training. Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.