T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
        - URL: http://arxiv.org/abs/2212.05726v1
- Date: Mon, 12 Dec 2022 06:29:04 GMT
- Title: T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
- Authors: Yiwei Qin, Weizhe Yuan, Graham Neubig, Pengfei Liu
- Abstract summary: We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
- Score: 94.69907794006826
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   Modern embedding-based metrics for evaluation of generated text generally
fall into one of two paradigms: discriminative metrics that are trained to
directly predict which outputs are of higher quality according to supervised
human annotations, and generative metrics that are trained to evaluate text
based on the probabilities of a generative model. Both have their advantages;
discriminative metrics are able to directly optimize for the problem of
distinguishing between good and bad outputs, while generative metrics can be
trained using abundant raw text. In this paper, we present a framework that
combines the best of both worlds, using both supervised and unsupervised
signals from whatever data we have available. We operationalize this idea by
training T5Score, a metric that uses these training signals with mT5 as the
backbone. We perform an extensive empirical comparison with other existing
metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility
of our method. Experimental results show that: T5Score achieves the best
performance on all datasets against existing top-scoring metrics at the segment
level. We release our code and models at https://github.com/qinyiwei/T5Score.
 
      
        Related papers
        - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
 Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
 arXiv  Detail & Related papers  (2025-02-17T05:37:02Z)
- TIGERScore: Towards Building Explainable Metric for All Text Generation   Tasks [44.801746603656504]
 We present TIGERScore, a metric that follows textbfInstruction textbfGuidance to perform textbfExplainable and textbfReference-free evaluation.
Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct.
 arXiv  Detail & Related papers  (2023-10-01T18:01:51Z)
- Generating and Imputing Tabular Data via Diffusion and Flow-based
  Gradient-Boosted Trees [11.732842929815401]
 Tabular data is hard to acquire and is subject to missing values.
This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data.
In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
 arXiv  Detail & Related papers  (2023-09-18T17:49:09Z)
- Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
 We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
 arXiv  Detail & Related papers  (2023-05-22T17:59:42Z)
- On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
 We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
 arXiv  Detail & Related papers  (2022-12-20T06:24:25Z)
- On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
 We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
 arXiv  Detail & Related papers  (2022-10-22T22:12:06Z)
- Not All Errors are Equal: Learning Text Generation Metrics using
  Stratified Error Synthesis [79.18261352971284]
 We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
 SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
 arXiv  Detail & Related papers  (2022-10-10T22:30:26Z)
- SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
 In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
 arXiv  Detail & Related papers  (2022-08-01T17:58:05Z)
- GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
 Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
 GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
 arXiv  Detail & Related papers  (2022-06-22T17:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.