Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
- URL: http://arxiv.org/abs/2405.07875v1
- Date: Mon, 13 May 2024 16:02:57 GMT
- Title: Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
- Authors: Michela Lorandi, Anya Belz,
- Abstract summary: Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation.
This report shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
- Score: 5.011488335517782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
Related papers
- CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - Spurious Correlations in Reference-Free Evaluation of Text Generation [35.80256755393739]
We show that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length.
We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.
arXiv Detail & Related papers (2022-04-21T05:32:38Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - CLEval: Character-Level Evaluation for Text Detection and Recognition
Tasks [18.25936871944743]
Existing evaluation metrics fail to provide a fair and reliable comparison among text detection and recognition methods.
Based on the fact that character is a key element of text, we propose a Character-Level Evaluation metric (CLEval)
CLEval provides a fine-grained evaluation of end-to-end results composed of the detection and recognition as well as individual evaluations for each module from the end-performance perspective.
arXiv Detail & Related papers (2020-06-11T08:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.