Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation
- URL: http://arxiv.org/abs/2204.07834v1
- Date: Sat, 16 Apr 2022 16:08:38 GMT
- Title: Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation
- Authors: Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng Liu, Dacheng Tao
- Abstract summary: We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
- Score: 80.16548523140025
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: For multilingual sequence-to-sequence pretrained language models
(multilingual Seq2Seq PLMs), e.g. mBART, the self-supervised pretraining task
is trained on a wide range of monolingual languages, e.g. 25 languages from
commoncrawl, while the downstream cross-lingual tasks generally progress on a
bilingual language subset, e.g. English-German, making there exists the
cross-lingual data discrepancy, namely \textit{domain discrepancy}, and
cross-lingual learning objective discrepancy, namely \textit{task discrepancy},
between the pretrain and finetune stages. To bridge the above cross-lingual
domain and task gaps, we extend the vanilla pretrain-finetune pipeline with
extra code-switching restore task. Specifically, the first stage employs the
self-supervised code-switching restore task as a pretext task, allowing the
multilingual Seq2Seq PLM to acquire some in-domain alignment information. And
for the second stage, we continuously fine-tune the model on labeled data
normally. Experiments on a variety of cross-lingual NLG tasks, including 12
bilingual translation tasks, 36 zero-shot translation tasks, and cross-lingual
summarization tasks show our model outperforms the strong baseline mBART
consistently. Comprehensive analyses indicate our approach could narrow the
cross-lingual sentence representation distance and improve low-frequency word
translation with trivial computational cost.
Related papers
- CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task
Information Retrieval [5.97515243922116]
We present the Charles University system for the MRL2023 Shared Task on Multi-lingual Multi-task Information Retrieval.
The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages.
Our solutions to both subtasks rely on the translate-test approach.
arXiv Detail & Related papers (2023-10-25T10:22:49Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - MCL@IITK at SemEval-2021 Task 2: Multilingual and Cross-lingual
Word-in-Context Disambiguation using Augmented Data, Signals, and
Transformers [1.869621561196521]
We present our approach for solving the SemEval 2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC)
The goal is to detect whether a given word common to both the sentences evokes the same meaning.
We submit systems for both the settings - Multilingual and Cross-Lingual.
arXiv Detail & Related papers (2021-04-04T08:49:28Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.