SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation
- URL: http://arxiv.org/abs/2305.13194v2
- Date: Wed, 1 Nov 2023 22:29:53 GMT
- Title: SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation
- Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez,
Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan
Das, Ankur P. Parikh
- Abstract summary: We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
- Score: 52.186343500576214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable automatic evaluation of summarization systems is challenging due to
the multifaceted and subjective nature of the task. This is especially the case
for languages other than English, where human evaluations are scarce. In this
work, we introduce SEAHORSE, a dataset for multilingual, multifaceted
summarization evaluation. SEAHORSE consists of 96K summaries with human ratings
along 6 dimensions of text quality: comprehensibility, repetition, grammar,
attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4
datasets. As a result of its size and scope, SEAHORSE can serve both as a
benchmark to evaluate learnt metrics, as well as a large-scale resource for
training such metrics. We show that metrics trained with SEAHORSE achieve
strong performance on the out-of-domain meta-evaluation benchmarks TRUE
(Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE
dataset and metrics publicly available for future research on multilingual and
multifaceted summarization evaluation.
Related papers
- Finding Replicable Human Evaluations via Stable Ranking Probability [28.87806354986128]
We use machine translation and its state-of-the-art human evaluation framework, MQM, as a case study to understand how to set up reliable human evaluations.
Our study on two language pairs provides concrete recommendations for designing replicable human evaluation studies.
arXiv Detail & Related papers (2024-04-01T20:50:13Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - LANS: Large-scale Arabic News Summarization Corpus [20.835296945483275]
We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task.
LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019.
arXiv Detail & Related papers (2022-10-24T20:54:01Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.