EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual
Representation Learning
- URL: http://arxiv.org/abs/2310.17233v1
- Date: Thu, 26 Oct 2023 08:31:00 GMT
- Title: EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual
Representation Learning
- Authors: Ping Guo, Xiangpeng Wei, Yue Hu, Baosong Yang, Dayiheng Liu, Fei
Huang, Jun Xie
- Abstract summary: We propose EMMAX: an EM-like Multilingual pretraining algorithm to learn (X)Crosslingual universals.
EMMAX unifies cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework.
- Score: 74.60554112841307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expressing universal semantics common to all languages is helpful in
understanding the meanings of complex and culture-specific sentences. The
research theme underlying this scenario focuses on learning universal
representations across languages with the usage of massive parallel corpora.
However, due to the sparsity and scarcity of parallel data, there is still a
big challenge in learning authentic ``universals'' for any two languages. In
this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm,
to learn (X)Cross-lingual universals with the aid of excessive multilingual
non-parallel data. EMMA-X unifies the cross-lingual representation learning
task and an extra semantic relation prediction task within an EM framework.
Both the extra semantic classifier and the cross-lingual sentence encoder
approximate the semantic relation of two sentences, and supervise each other
until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly
introduced benchmark containing 12 widely studied cross-lingual tasks that
fully depend on sentence-level representations. Results reveal that EMMA-X
achieves state-of-the-art performance. Further geometric analysis of the built
representation space with three requirements demonstrates the superiority of
EMMA-X over advanced models.
Related papers
- VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - SimCSum: Joint Learning of Simplification and Cross-lingual
Summarization for Cross-lingual Science Journalism [8.187718963808484]
Cross-lingual science journalism generates popular science stories of scientific articles different from the source language for a non-expert audience.
We improve cross-lingual summary generation by joint training of two high-level NLP tasks, simplification and cross-lingual summarization.
SimCSum demonstrates statistically significant improvements over the state-of-the-art on two non-synthetic cross-lingual scientific datasets.
arXiv Detail & Related papers (2023-04-04T08:24:22Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Multi-level Distillation of Semantic Knowledge for Pre-training
Multilingual Language Model [15.839724725094916]
Multi-level Multilingual Knowledge Distillation (MMKD) is a novel method for improving multilingual language models.
We employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT.
We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD.
arXiv Detail & Related papers (2022-11-02T15:23:13Z) - Learning Multilingual Representation for Natural Language Understanding
with Enhanced Cross-Lingual Supervision [42.724921817550516]
We propose a network named decomposed attention (DA) as a replacement of MA.
The DA consists of an intra-lingual attention (IA) and a cross-lingual attention (CA), which model intralingual and cross-lingual supervisions respectively.
Experiments on various cross-lingual natural language understanding tasks show that the proposed architecture and learning strategy significantly improve the model's cross-lingual transferability.
arXiv Detail & Related papers (2021-06-09T16:12:13Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - On Learning Universal Representations Across Languages [37.555675157198145]
We extend existing approaches to learn sentence-level representations and show the effectiveness on cross-lingual understanding and generation.
Specifically, we propose a Hierarchical Contrastive Learning (HiCTL) method to learn universal representations for parallel sentences distributed in one or multiple languages.
We conduct evaluations on two challenging cross-lingual tasks, XTREME and machine translation.
arXiv Detail & Related papers (2020-07-31T10:58:39Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.