Understanding Translationese in Cross-Lingual Summarization
- URL: http://arxiv.org/abs/2212.07220v2
- Date: Tue, 10 Oct 2023 03:00:35 GMT
- Title: Understanding Translationese in Cross-Lingual Summarization
- Authors: Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu,
Zhixu Li, Jie Zhou
- Abstract summary: Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
- Score: 106.69566000567598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a document in a source language, cross-lingual summarization (CLS) aims
at generating a concise summary in a different target language. Unlike
monolingual summarization (MS), naturally occurring source-language documents
paired with target-language summaries are rare. To collect large-scale CLS
data, existing datasets typically involve translation in their creation.
However, the translated text is distinguished from the text originally written
in that language, i.e., translationese. In this paper, we first confirm that
different approaches of constructing CLS datasets will lead to different
degrees of translationese. Then we systematically investigate how
translationese affects CLS model evaluation and performance when it appears in
source documents or target summaries. In detail, we find that (1) the
translationese in documents or summaries of test sets might lead to the
discrepancy between human judgment and automatic evaluation; (2) the
translationese in training sets would harm model performance in real-world
applications; (3) though machine-translated documents involve translationese,
they are very useful for building CLS systems on low-resource languages under
specific training strategies. Lastly, we give suggestions for future CLS
research including dataset and model developments. We hope that our work could
let researchers notice the phenomenon of translationese in CLS and take it into
account in the future.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - ConVerSum: A Contrastive Learning-based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents [4.029675201787349]
Cross-lingual summarization is a sophisticated branch in Natural Language Processing.
There is no feasible solution for CLS when there is no available high-quality CLS data.
We propose a novel data-efficient approach, ConVerSum, for CLS leveraging the power of contrastive learning.
arXiv Detail & Related papers (2024-08-17T19:03:53Z) - Leveraging Entailment Judgements in Cross-Lingual Summarisation [3.771795120498178]
Cross-Lingual Summarisation ( CLS) datasets are prone to include document-summary pairs where the reference summary is unfaithful to the corresponding document.
This low data quality misleads model learning and obscures evaluation results.
We propose off-the-shelf cross-lingual Natural Language Inference (X-NLI) to evaluate faithfulness of reference and model generated summaries.
arXiv Detail & Related papers (2024-08-01T16:18:09Z) - Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization [25.182666420286132]
Given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation.
This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching.
We introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news.
arXiv Detail & Related papers (2023-03-07T17:52:51Z) - A Variational Hierarchical Model for Neural Cross-Lingual Summarization [85.44969140204026]
Cross-lingual summarization () is to convert a document in one language to a summary in another one.
Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model.
We propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder.
arXiv Detail & Related papers (2022-03-08T02:46:11Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.