GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
- URL: http://arxiv.org/abs/2206.11249v3
- Date: Fri, 24 Jun 2022 12:48:24 GMT
- Title: GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
- Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex
Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna
Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula,
Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch,
Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin
Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt,
Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou,
Jordan Clive, Joshua Maynez, Jo\~ao Sedoc, Juraj Juraska, Kaustubh Dhole,
Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo F. R. Ribeiro,
Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White,
Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani,
Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish
Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood,
Salomey Osei, Samuel Cahyawijaya, Sanja \v{S}tajner, Sebastien Montella,
Shailza, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin
Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine
Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou
- Abstract summary: Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
- Score: 161.1761414080574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation in machine learning is usually informed by past choices, for
example which datasets or metrics to use. This standardization enables the
comparison on equal footing using leaderboards, but the evaluation choices
become sub-optimal as better alternatives arise. This problem is especially
pertinent in natural language generation which requires ever-improving suites
of datasets, metrics, and human evaluation to make definitive claims. To make
following best model evaluation practices easier, we introduce GEMv2. The new
version of the Generation, Evaluation, and Metrics Benchmark introduces a
modular infrastructure for dataset, model, and metric developers to benefit
from each others work. GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card
creation and rendering tools make it easier to add new datasets to the living
benchmark.
Related papers
- MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations [0.5528844566370006]
We present a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations taxonomy.
MELO was built using high-quality, pre-existent human annotations.
arXiv Detail & Related papers (2024-10-10T19:14:54Z) - Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Instruction Mining: Instruction Data Selection for Tuning Large Language Models [18.378654454336136]
InstructMining is designed for automatically selecting premium instruction-following data for finetuning large language models.
We show that InstructMining achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.
arXiv Detail & Related papers (2023-07-12T16:37:31Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - X-FACT: A New Benchmark Dataset for Multilingual Fact Checking [21.2633064526968]
We introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims.
The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers.
arXiv Detail & Related papers (2021-06-17T05:09:54Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.