CoTK: An Open-Source Toolkit for Fast Development and Fair Evaluation of
Text Generation
- URL: http://arxiv.org/abs/2002.00583v1
- Date: Mon, 3 Feb 2020 07:15:29 GMT
- Title: CoTK: An Open-Source Toolkit for Fast Development and Fair Evaluation of
Text Generation
- Authors: Fei Huang, Dazhen Wan, Zhihong Shao, Pei Ke, Jian Guan, Yilin Niu,
Xiaoyan Zhu, Minlie Huang
- Abstract summary: In model development, CoTK helps handle the cumbersome issues, such as data processing, metric implementation, and reproduction.
In model evaluation, CoTK provides implementation for many commonly used metrics and benchmark models across different experimental settings.
- Score: 91.58324412629477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In text generation evaluation, many practical issues, such as inconsistent
experimental settings and metric implementations, are often ignored but lead to
unfair evaluation and untenable conclusions. We present CoTK, an open-source
toolkit aiming to support fast development and fair evaluation of text
generation. In model development, CoTK helps handle the cumbersome issues, such
as data processing, metric implementation, and reproduction. It standardizes
the development steps and reduces human errors which may lead to inconsistent
experimental settings. In model evaluation, CoTK provides implementation for
many commonly used metrics and benchmark models across different experimental
settings. As a unique feature, CoTK can signify when and which metric cannot be
fairly compared. We demonstrate that it is convenient to use CoTK for model
development and evaluation, particularly across different experimental
settings.
Related papers
- CEval: A Benchmark for Evaluating Counterfactual Text Generation [2.899704155417792]
We propose CEval, a benchmark for comparing counterfactual text generation methods.
Our experiments found no perfect method for generating counterfactual text.
By making CEval available as an open-source Python library, we encourage the community to contribute more methods.
arXiv Detail & Related papers (2024-04-26T15:23:47Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - On the Effectiveness of Automated Metrics for Text Generation Systems [4.661309379738428]
We propose a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets.
The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems.
arXiv Detail & Related papers (2022-10-24T08:15:28Z) - Out of the BLEU: how should we assess quality of the Code Generation
models? [3.699097874146491]
We present a study on the applicability of six metrics -- BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY -- for evaluation of code generation models.
None of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points.
Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU.
arXiv Detail & Related papers (2022-08-05T13:00:16Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Are Missing Links Predictable? An Inferential Benchmark for Knowledge
Graph Completion [79.07695173192472]
InferWiki improves upon existing benchmarks in inferential ability, assumptions, and patterns.
Each testing sample is predictable with supportive data in the training set.
In experiments, we curate two settings of InferWiki varying in sizes and structures, and apply the construction process on CoDEx as comparative datasets.
arXiv Detail & Related papers (2021-08-03T09:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.