T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
- URL: http://arxiv.org/abs/2212.05726v1
- Date: Mon, 12 Dec 2022 06:29:04 GMT
- Title: T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
- Authors: Yiwei Qin, Weizhe Yuan, Graham Neubig, Pengfei Liu
- Abstract summary: We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
- Score: 94.69907794006826
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modern embedding-based metrics for evaluation of generated text generally
fall into one of two paradigms: discriminative metrics that are trained to
directly predict which outputs are of higher quality according to supervised
human annotations, and generative metrics that are trained to evaluate text
based on the probabilities of a generative model. Both have their advantages;
discriminative metrics are able to directly optimize for the problem of
distinguishing between good and bad outputs, while generative metrics can be
trained using abundant raw text. In this paper, we present a framework that
combines the best of both worlds, using both supervised and unsupervised
signals from whatever data we have available. We operationalize this idea by
training T5Score, a metric that uses these training signals with mT5 as the
backbone. We perform an extensive empirical comparison with other existing
metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility
of our method. Experimental results show that: T5Score achieves the best
performance on all datasets against existing top-scoring metrics at the segment
level. We release our code and models at https://github.com/qinyiwei/T5Score.
Related papers
- TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks [44.801746603656504]
We present TIGERScore, a metric that follows textbfInstruction textbfGuidance to perform textbfExplainable and textbfReference-free evaluation.
Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct.
arXiv Detail & Related papers (2023-10-01T18:01:51Z) - Generating and Imputing Tabular Data via Diffusion and Flow-based
Gradient-Boosted Trees [11.732842929815401]
Tabular data is hard to acquire and is subject to missing values.
This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data.
In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
arXiv Detail & Related papers (2023-09-18T17:49:09Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.