Evaluating the Efficacy of Summarization Evaluation across Languages
- URL: http://arxiv.org/abs/2106.01478v1
- Date: Wed, 2 Jun 2021 21:28:01 GMT
- Title: Evaluating the Efficacy of Summarization Evaluation across Languages
- Authors: Fajri Koto and Jey Han Lau and Timothy Baldwin
- Abstract summary: We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall)
We find that using multilingual BERT within BERTScore performs well across all languages, at a level above that for English.
- Score: 33.46519116869276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While automatic summarization evaluation methods developed for English are
routinely applied to other languages, this is the first attempt to
systematically quantify their panlinguistic efficacy. We take a summarization
corpus for eight different languages, and manually annotate generated summaries
for focus (precision) and coverage (recall). Based on this, we evaluate 19
summarization evaluation metrics, and find that using multilingual BERT within
BERTScore performs well across all languages, at a level above that for
English.
Related papers
- Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs [36.30321941154582]
Hercule is a cross-lingual evaluation model that learns to assign scores to responses based on easily available reference answers in English.
This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment.
arXiv Detail & Related papers (2024-10-17T09:45:32Z) - Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation [50.60733773088296]
We conduct a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023)
We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context.
Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF.
arXiv Detail & Related papers (2024-06-06T09:18:42Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - Monolingual and Cross-Lingual Acceptability Judgments with the Italian
CoLA corpus [2.418273287232718]
We describe the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments.
We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformerbased approaches can benefit from using sentences in two languages during fine-tuning.
arXiv Detail & Related papers (2021-09-24T16:18:53Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z) - Automatic Discourse Segmentation: an evaluation in French [65.00134288222509]
We describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality.
We have developed three models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling.
arXiv Detail & Related papers (2020-02-10T21:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.