On the Limitations of Reference-Free Evaluations of Generated Text
- URL: http://arxiv.org/abs/2210.12563v1
- Date: Sat, 22 Oct 2022 22:12:06 GMT
- Title: On the Limitations of Reference-Free Evaluations of Generated Text
- Authors: Daniel Deutsch and Rotem Dror and Dan Roth
- Abstract summary: We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
- Score: 64.81682222169113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is significant interest in developing evaluation metrics which
accurately estimate the quality of generated text without the aid of a
human-written reference text, which can be time consuming and expensive to
collect or entirely unavailable in online applications. However, in this work,
we demonstrate that these reference-free metrics are inherently biased and
limited in their ability to evaluate generated text, and we argue that they
should not be used to measure progress on tasks like machine translation or
summarization. We show how reference-free metrics are equivalent to using one
generation model to evaluate another, which has several limitations: (1) the
metrics can be optimized at test time to find the approximate best-possible
output, (2) they are inherently biased toward models which are more similar to
their own, and (3) they can be biased against higher-quality outputs, including
those written by humans. Therefore, we recommend that reference-free metrics
should be used as diagnostic tools for analyzing and understanding model
behavior instead of measures of how well models perform a task, in which the
goal is to achieve as high of a score as possible.
Related papers
- What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - ICE-Score: Instructing Large Language Models to Evaluate Code [7.556444391696562]
We propose textttICE-Score, a new evaluation metric via instructing large language models for code assessments.
Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences.
Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation.
arXiv Detail & Related papers (2023-04-27T16:38:17Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.