Does Summary Evaluation Survive Translation to Other Languages?
- URL: http://arxiv.org/abs/2109.08129v1
- Date: Thu, 16 Sep 2021 17:35:01 GMT
- Title: Does Summary Evaluation Survive Translation to Other Languages?
- Authors: Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon,
Sebastian M\"oller
- Abstract summary: We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The creation of a large summarization quality dataset is a considerable,
expensive, time-consuming effort, requiring careful planning and setup. It
includes producing human-written and machine-generated summaries and evaluation
of the summaries by humans, preferably by linguistic experts, and by automatic
evaluation tools. If such effort is made in one language, it would be
beneficial to be able to use it in other languages. To investigate how much we
can trust the translation of such dataset without repeating human annotations
in another language, we translated an existing English summarization dataset,
SummEval dataset, to four different languages and analyzed the scores from the
automatic evaluation metrics in translated languages, as well as their
correlation with human annotations in the source language. Our results reveal
that although translation changes the absolute value of automatic scores, the
scores keep the same rank order and approximately the same correlations with
human annotations.
Related papers
- Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation [0.9576327614980397]
This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations.
As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them.
The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts.
arXiv Detail & Related papers (2024-06-14T14:47:19Z) - Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation [50.60733773088296]
We conduct a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023)
We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context.
Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF.
arXiv Detail & Related papers (2024-06-06T09:18:42Z) - Iterative Translation Refinement with Large Language Models [25.90607157524168]
We propose iteratively prompting a large language model to self-correct a translation.
We also discuss the challenges in evaluation and relation to human performance and translationese.
arXiv Detail & Related papers (2023-06-06T16:51:03Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Backretrieval: An Image-Pivoted Evaluation Metric for Cross-Lingual Text
Representations Without Parallel Corpora [19.02834713111249]
Backretrieval is shown to correlate with ground truth metrics on annotated datasets.
Our experiments conclude with a case study on a recipe dataset without parallel cross-lingual data.
arXiv Detail & Related papers (2021-05-11T12:14:24Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Cross-lingual Approach to Abstractive Summarization [0.0]
Cross-lingual model transfers are successfully applied in low-resource languages.
We used a pretrained English summarization model based on deep neural networks and sequence-to-sequence architecture.
We developed several models with different proportions of target language data for fine-tuning.
arXiv Detail & Related papers (2020-12-08T09:30:38Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.