Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
- URL: http://arxiv.org/abs/2112.04139v1
- Date: Wed, 8 Dec 2021 06:34:58 GMT
- Title: Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
- Authors: Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob
Morrison, Alexander R. Fabbri, Yejin Choi, Noah A. Smith
- Abstract summary: We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
- Score: 117.62186420147563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing researchers have identified limitations of
evaluation methodology for generation tasks, with new questions raised about
the validity of automatic metrics and of crowdworker judgments. Meanwhile,
efforts to improve generation models tend to focus on simple n-gram overlap
metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics
should each more directly benefit and inform the other. We therefore propose a
generalization of leaderboards, bidimensional leaderboards (Billboards), that
simultaneously tracks progress in language generation tasks and metrics for
their evaluation. Unlike conventional unidimensional leaderboards that sort
submitted systems by predetermined metrics, a Billboard accepts both generators
and evaluation metrics as competing entries. A Billboard automatically creates
an ensemble metric that selects and linearly combines a few metrics based on a
global analysis across generators. Further, metrics are ranked based on their
correlations with human judgments. We release four Billboards for machine
translation, summarization, and image captioning. We demonstrate that a linear
ensemble of a few diverse metrics sometimes substantially outperforms existing
metrics in isolation. Our mixed-effects model analysis shows that most
automatic metrics, especially the reference-based ones, overrate machine over
human generation, demonstrating the importance of updating metrics as
generation models become stronger (and perhaps more similar to humans) in the
future.
Related papers
- EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation [27.129551973093008]
InfoLM is a family of untrained metrics that can be viewed as a string-based metric.
This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
arXiv Detail & Related papers (2021-12-02T20:09:29Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for
Machine Translation [5.972205906525993]
We investigate which metrics have the highest accuracy to make system-level quality rankings for pairs of systems.
We show that the sole use of BLEU negatively affected the past development of improved models.
arXiv Detail & Related papers (2021-07-22T17:22:22Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.