Toward Human-Like Evaluation for Natural Language Generation with Error
Analysis
- URL: http://arxiv.org/abs/2212.10179v1
- Date: Tue, 20 Dec 2022 11:36:22 GMT
- Title: Toward Human-Like Evaluation for Natural Language Generation with Error
Analysis
- Authors: Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Derek F. Wong,
Dacheng Tao
- Abstract summary: Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors can produce high-quality human judgments.
This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis.
We augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors.
- Score: 93.34894810865364
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The state-of-the-art language model-based automatic metrics, e.g. BARTScore,
benefiting from large-scale contextualized pre-training, have been successfully
used in a wide range of natural language generation (NLG) tasks, including
machine translation, text summarization, and data-to-text. Recent studies show
that considering both major errors (e.g. mistranslated tokens) and minor errors
(e.g. imperfections in fluency) can produce high-quality human judgments. This
inspires us to approach the final goal of the evaluation metrics (human-like
evaluations) by automatic error analysis. To this end, we augment BARTScore by
incorporating the human-like error analysis strategies, namely BARTScore++,
where the final score consists of both the evaluations of major errors and
minor errors. Experimental results show that BARTScore++ can consistently
improve the performance of vanilla BARTScore and outperform existing
top-scoring metrics in 20 out of 25 test settings. We hope our technique can
also be extended to other pre-trained model-based metrics. We will release our
code and scripts to facilitate the community.
Related papers
- Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - SUT: Active Defects Probing for Transcompiler Models [24.01532199512389]
We introduce a new metrics for programming language translation and these metrics address basic syntax errors.
Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests.
arXiv Detail & Related papers (2023-10-22T07:16:02Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments.
Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences.
Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.