MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types
- URL: http://arxiv.org/abs/2306.10452v1
- Date: Sun, 18 Jun 2023 01:38:53 GMT
- Title: MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types
- Authors: Keerthiram Murugesan, Sarathkrishna Swaminathan, Soham Dan, Subhajit
Chaudhury, Chulaka Gunasekara, Maxwell Crouse, Diwakar Mahajan, Ibrahim
Abdelaziz, Achille Fokoue, Pavan Kapanipathi, Salim Roukos, Alexander Gray
- Abstract summary: We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
- Score: 68.76742370525234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing interest in large language models, the need for evaluating
the quality of machine text compared to reference (typically human-generated)
text has become focal attention. Most recent works focus either on
task-specific evaluation metrics or study the properties of machine-generated
text captured by the existing metrics. In this work, we propose a new
evaluation scheme to model human judgments in 7 NLP tasks, based on the
fine-grained mismatches between a pair of texts. Inspired by the recent efforts
in several NLP tasks for fine-grained evaluation, we introduce a set of 13
mismatch error types such as spatial/geographic errors, entity errors, etc, to
guide the model for better prediction of human judgments. We propose a neural
framework for evaluating machine texts that uses these mismatch error types as
auxiliary tasks and re-purposes the existing single-number evaluation metrics
as additional scalar features, in addition to textual features extracted from
the machine and reference texts. Our experiments reveal key insights about the
existing metrics via the mismatch errors. We show that the mismatch errors
between the sentence pairs on the held-out datasets from 7 NLP tasks align well
with the human evaluation.
Related papers
- Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments.
Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences.
Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.