Advancing Multilingual Pre-training: TRIP Triangular Document-level
Pre-training for Multilingual Language Models
- URL: http://arxiv.org/abs/2212.07752v2
- Date: Sat, 13 May 2023 08:45:14 GMT
- Title: Advancing Multilingual Pre-training: TRIP Triangular Document-level
Pre-training for Multilingual Language Models
- Authors: Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Wai Lam, Furu
Wei
- Abstract summary: We present textbfTriangular Document-level textbfPre-training (textbfTRIP), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting.
TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
- Score: 107.83158521848372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of multilingual sequence-to-sequence pre-training, most
existing approaches rely on document-level monolingual corpora in many
different languages, sentence-level bilingual corpora,\footnote{In this paper,
we use `bilingual corpora' to denote parallel corpora with `bilingual
translation pairs' in many different language pairs, each consisting of two
sentences/documents with the same meaning written in different languages. We
use `trilingual corpora' to denote parallel corpora with `trilingual
translation pairs' in many different language combinations, each consisting of
three sentences/documents.} and sometimes synthetic document-level bilingual
corpora. This hampers the performance with cross-lingual document-level tasks
such as document-level translation. Therefore, we propose to mine and leverage
document-level trilingual parallel corpora to improve sequence-to-sequence
multilingual pre-training. We present \textbf{Tri}angular Document-level
\textbf{P}re-training (\textbf{TRIP}), which is the first in the field to
accelerate the conventional monolingual and bilingual objectives into a
trilingual objective with a novel method called Grafting. Experiments show that
TRIP achieves several strong state-of-the-art (SOTA) scores on three
multilingual document-level machine translation benchmarks and one
cross-lingual abstractive summarization benchmark, including consistent
improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
Related papers
- T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Multilingual Pre-training with Language and Task Adaptation for
Multilingual Text Style Transfer [14.799109368073548]
We exploit the pre-trained seq2seq model mBART for multilingual text style transfer.
Using machine translated data as well as gold aligned English sentences yields state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T11:27:48Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Syntax-augmented Multilingual BERT for Cross-lingual Transfer [37.99210035238424]
This work shows that explicitly providing language syntax and training mBERT helps cross-lingual transfer.
Experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks.
arXiv Detail & Related papers (2021-06-03T21:12:50Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies [0.0]
This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora.
The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels.
Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
arXiv Detail & Related papers (2020-12-15T10:42:40Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.