SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation
- URL: http://arxiv.org/abs/2305.13194v2
- Date: Wed, 1 Nov 2023 22:29:53 GMT
- Title: SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation
- Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez,
Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan
Das, Ankur P. Parikh
- Abstract summary: We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
- Score: 52.186343500576214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable automatic evaluation of summarization systems is challenging due to
the multifaceted and subjective nature of the task. This is especially the case
for languages other than English, where human evaluations are scarce. In this
work, we introduce SEAHORSE, a dataset for multilingual, multifaceted
summarization evaluation. SEAHORSE consists of 96K summaries with human ratings
along 6 dimensions of text quality: comprehensibility, repetition, grammar,
attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4
datasets. As a result of its size and scope, SEAHORSE can serve both as a
benchmark to evaluate learnt metrics, as well as a large-scale resource for
training such metrics. We show that metrics trained with SEAHORSE achieve
strong performance on the out-of-domain meta-evaluation benchmarks TRUE
(Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE
dataset and metrics publicly available for future research on multilingual and
multifaceted summarization evaluation.
Related papers
- MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)
MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.
We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - FaMTEB: Massive Text Embedding Benchmark in Persian Language [9.204800002382042]
This paper introduces a comprehensive benchmark for Persian (Farsi) text embeddings built upon the Massive Text Embedding Benchmark (MTEB)
Our benchmark includes 63 datasets spanning seven different tasks.
We evaluate the performance of several Persian and multilingual embedding models in a range of tasks.
arXiv Detail & Related papers (2025-02-17T09:05:21Z) - Finding Replicable Human Evaluations via Stable Ranking Probability [28.87806354986128]
We use machine translation and its state-of-the-art human evaluation framework, MQM, as a case study to understand how to set up reliable human evaluations.
Our study on two language pairs provides concrete recommendations for designing replicable human evaluation studies.
arXiv Detail & Related papers (2024-04-01T20:50:13Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - LANS: Large-scale Arabic News Summarization Corpus [20.835296945483275]
We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task.
LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019.
arXiv Detail & Related papers (2022-10-24T20:54:01Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.