The GEM Benchmark: Natural Language Generation, its Evaluation and
Metrics
- URL: http://arxiv.org/abs/2102.01672v2
- Date: Wed, 3 Feb 2021 18:09:36 GMT
- Title: The GEM Benchmark: Natural Language Generation, its Evaluation and
Metrics
- Authors: Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka
Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu,
Miruna Clinciu, Dipanjan Das, Kaustubh D. Dhole, Wanyu Du, Esin Durmus,
Ond\v{r}ej Du\v{s}ek, Chris Emezue, Varun Gangal, Cristina Garbacea,
Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji,
Shailza Jolly, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela,
Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique
Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin
Nadeem, Shashi Narayan, Vitaly Nikolaev, Rubungo Andre Niyongabo, Salomey
Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas
Raunak, Juan Diego Rodriguez, Sashank Santhanam, Jo\~ao Sedoc, Thibault
Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla
Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila
Yerukola, Jiawei Zhou
- Abstract summary: We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics.
Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models.
- Score: 66.96150429230035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG),
its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly
evolving ecosystem of automated metrics, datasets, and human evaluation
standards. However, due to this moving target, new models often still evaluate
on divergent anglo-centric corpora with well-established, but flawed, metrics.
This disconnect makes it challenging to identify the limitations of current
models and opportunities for progress. Addressing this limitation, GEM provides
an environment in which models can easily be applied to a wide set of corpora
and evaluation strategies can be tested. Regular updates to the benchmark will
help NLG research become more multilingual and evolve the challenge alongside
models.
This paper serves as the description of the initial release for which we are
organizing a shared task at our ACL 2021 Workshop and to which we invite the
entire NLG community to participate.
Related papers
- Benchmarking LLMs' Judgments with No Gold Standard [8.517244114791913]
We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs)
In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner.
We also present GRE-bench, which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
arXiv Detail & Related papers (2024-11-11T16:58:36Z) - BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - GLGE: A New General Language Generation Evaluation Benchmark [139.25515221280767]
General Language Generation Evaluation (GLGE) is a new multi-task benchmark for evaluating the generalization capabilities of NLG models.
To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines.
arXiv Detail & Related papers (2020-11-24T06:59:45Z) - A Survey of Evaluation Metrics Used for NLG Systems [19.20118684502313]
The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks.
Unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge.
The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014.
arXiv Detail & Related papers (2020-08-27T09:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.